I load from a file "page_titles" and "page_ids" and put them in a dictionary. One option I haven't used would be putting than into a database and INNER Joining with the pagelinks table to just obtain the links for those articles. Still, if the list is 300.000, even this is just 20% of the database, it is still a lot.
Marc ᐧ 2015-03-13 19:51 GMT+01:00 John <[email protected]>: > Where are you getting your list of pages from? > > On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <[email protected]> wrote: > >> Hi John, >> >> >> My queries are to obtain "inlinks" and "outlinks" for some articles I >> have in a group (x). Then I check (using python) if they have inlinks and >> outlinks from another group of articles. By now I am doing a query for each >> article. I wanted to obtain all links for group (x) and then do this >> comprovation....But getting all links for groups as big as 300000 articles >> would imply 6 million links. Is it possible to obtain all this or is there >> a MySQL/RAM limit? >> >> Thanks. >> >> Marc >> >> ᐧ >> >> 2015-03-13 19:29 GMT+01:00 <[email protected]>: >> >>> Send Labs-l mailing list submissions to >>> [email protected] >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> https://lists.wikimedia.org/mailman/listinfo/labs-l >>> or, via email, send a message with subject or body 'help' to >>> [email protected] >>> >>> You can reach the person managing the list at >>> [email protected] >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of Labs-l digest..." >>> >>> >>> Today's Topics: >>> >>> 1. dimension well my queries for very large tables like >>> pagelinks - Tool Labs (Marc Miquel) >>> 2. Re: dimension well my queries for very large tables like >>> pagelinks - Tool Labs (John) >>> 3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt) >>> 4. Re: Questions regarding the Labs Terms of use (Ryan Lane) >>> 5. Re: Questions regarding the Labs Terms of use (Pine W) >>> >>> >>> ---------------------------------------------------------------------- >>> >>> Message: 1 >>> Date: Fri, 13 Mar 2015 17:59:09 +0100 >>> From: Marc Miquel <[email protected]> >>> To: "[email protected]" <[email protected]> >>> Subject: [Labs-l] dimension well my queries for very large tables like >>> pagelinks - Tool Labs >>> Message-ID: >>> <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj= >>> [email protected]> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Hello guys, >>> >>> I have a question regarding Tool Labs. I am doing research on links and >>> although I know very well what I am looking for I struggle in how to get >>> it >>> effectively... >>> >>> I need to know your opinion because you know very well the system and >>> what's feasible and what is not. >>> >>> I explain you what I need to do: >>> I have a list of articles for different languages which I need to check >>> their pagelinks and see where they point to and from where they point at >>> them. >>> >>> I now do a query for each article id in this list of articles, which goes >>> from 80000 in some wikipedias to 300000 in other and more. I have to do >>> it >>> several times and it is very time consuming (several days). I wish I >>> could >>> only count the total of links for each case but I need to see only some >>> of >>> the links per article. >>> >>> I was thinking about getting all pagelinks and iterating using python >>> (which is the language I use for all this). This would be much faster >>> because I'd save all the queries, one per article, I am doing now. But >>> pagelinks table has millions of rows and I cannot load that because mysql >>> would die. I could buffer, but I haven't tried if it works also. >>> >>> I am considering creating a personal table in the database with titles, >>> ids, and inner joining to just obtain the pagelinks for these 300.000 >>> articles. With this I would just retrieve 20% of the database instead of >>> the 100%. That would be maybe 8M rows sometimes (page_title or page_id, >>> one >>> of both per row), or even more... loaded into python dictionaries and >>> lists. Would that be a problem...? I have no idea of how much RAM this >>> implies and how much I can use in Tool labs. >>> >>> I am totally lost when I get these problems related to scale...I thought >>> about writing to the IRC channel but I thought it was maybe too long and >>> too specific. If you give me any hint that would really help. >>> >>> Thank you very much! >>> >>> Cheers, >>> >>> Marc Miquel >>> ᐧ >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: < >>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html >>> > >>> >>> ------------------------------ >>> >>> Message: 2 >>> Date: Fri, 13 Mar 2015 13:07:20 -0400 >>> From: John <[email protected]> >>> To: Wikimedia Labs <[email protected]> >>> Subject: Re: [Labs-l] dimension well my queries for very large tables >>> like pagelinks - Tool Labs >>> Message-ID: >>> <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw= >>> [email protected]> >>> Content-Type: text/plain; charset="utf-8" >>> >>> what kind of queries are you doing? odds are they can be optimized. >>> >>> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <[email protected]> >>> wrote: >>> >>> > Hello guys, >>> > >>> > I have a question regarding Tool Labs. I am doing research on links and >>> > although I know very well what I am looking for I struggle in how to >>> get it >>> > effectively... >>> > >>> > I need to know your opinion because you know very well the system and >>> > what's feasible and what is not. >>> > >>> > I explain you what I need to do: >>> > I have a list of articles for different languages which I need to check >>> > their pagelinks and see where they point to and from where they point >>> at >>> > them. >>> > >>> > I now do a query for each article id in this list of articles, which >>> goes >>> > from 80000 in some wikipedias to 300000 in other and more. I have to >>> do it >>> > several times and it is very time consuming (several days). I wish I >>> could >>> > only count the total of links for each case but I need to see only >>> some of >>> > the links per article. >>> > >>> > I was thinking about getting all pagelinks and iterating using python >>> > (which is the language I use for all this). This would be much faster >>> > because I'd save all the queries, one per article, I am doing now. But >>> > pagelinks table has millions of rows and I cannot load that because >>> mysql >>> > would die. I could buffer, but I haven't tried if it works also. >>> > >>> > I am considering creating a personal table in the database with titles, >>> > ids, and inner joining to just obtain the pagelinks for these 300.000 >>> > articles. With this I would just retrieve 20% of the database instead >>> of >>> > the 100%. That would be maybe 8M rows sometimes (page_title or >>> page_id, one >>> > of both per row), or even more... loaded into python dictionaries and >>> > lists. Would that be a problem...? I have no idea of how much RAM this >>> > implies and how much I can use in Tool labs. >>> > >>> > I am totally lost when I get these problems related to scale...I >>> thought >>> > about writing to the IRC channel but I thought it was maybe too long >>> and >>> > too specific. If you give me any hint that would really help. >>> > >>> > Thank you very much! >>> > >>> > Cheers, >>> > >>> > Marc Miquel >>> > ᐧ >>> > >>> > _______________________________________________ >>> > Labs-l mailing list >>> > [email protected] >>> > https://lists.wikimedia.org/mailman/listinfo/labs-l >>> > >>> > >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: < >>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html >>> > >>> >>> ------------------------------ >>> >>> Message: 3 >>> Date: Fri, 13 Mar 2015 17:36:00 +0000 >>> From: Tim Landscheidt <[email protected]> >>> To: [email protected] >>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>> Message-ID: <[email protected]> >>> Content-Type: text/plain >>> >>> (anonymous) wrote: >>> >>> > [...] >>> >>> > To be clear: I'm not going to make my code proprietary in >>> > any way. I just wanted to know whether I'm entitled to ask >>> > for the source of every Labs bot ;-) >>> >>> Everyone is entitled to /ask/, but I don't think you have a >>> right to /receive/ the source :-). >>> >>> AFAIK, there are two main reasons for the clause: >>> >>> a) WMF doesn't want to have to deal with individual licences >>> that may or may not have the potential for litigation >>> ("The Software shall be used for Good, not Evil"). With >>> requiring OSI-approved, tried and true licences, the risk >>> is negligible. >>> >>> b) Bots and tools running on an infrastructure financed by >>> donors, like contributions to Wikipedia & Co., shouldn't >>> be usable for blackmail. Noone should be in a legal po- >>> sition to demand something "or else ..." The perpetuity >>> of OS licences guarantees that everyone can be truly >>> thankful to developers without having to fear that other- >>> wise they shut down devices, delete content, etc. >>> >>> But the nice thing about collaboratively developed open >>> source software is that it usually is of a better quality, >>> so clandestine code is often not that interesting. >>> >>> Tim >>> >>> >>> >>> >>> ------------------------------ >>> >>> Message: 4 >>> Date: Fri, 13 Mar 2015 11:52:18 -0600 >>> From: Ryan Lane <[email protected]> >>> To: Wikimedia Labs <[email protected]> >>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>> Message-ID: >>> < >>> calkgca3lv-sqoeibesm7ckc0gapjwph_b0hstx+actakmdu...@mail.gmail.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < >>> [email protected]> >>> wrote: >>> >>> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use >>> > (verbatim): "Do not use or install any software unless the software is >>> > licensed under an Open Source license". >>> > What about tools and services made up of software themselves? Do they >>> have >>> > to be Open Source? >>> > Strictly speaking, do the Terms of use require that all code be made >>> > available to the public? >>> > Thanks in advance. >>> > >>> > >>> As the person who wrote the initial terms and included this I can speak >>> to >>> the spirit of the term (I'm not a lawyer, so I won't try to go into any >>> legal issues). >>> >>> I created Labs with the intent that it could be used as a mechanism to >>> fork >>> the projects as a whole, if necessary. A means to this end was including >>> non-WMF employees in the process of infrastructure operations (which is >>> outside the goals of the tools project in Labs). Tools/services that are >>> can't be distributed publicly harm that goal. Tools/services that aren't >>> open source completely break that goal. It's fine if you wish to not >>> maintain the code in a public git repo, but if another tool maintainer >>> wishes to publish your code, there should be nothing blocking that. >>> >>> Depending on external closed source services is a debatable topic. I know >>> in the past we've decided to allow it. It goes against the spirit of the >>> project, but it doesn't require us to distribute close sourced software >>> in >>> the case of a fork. >>> >>> My personal opinion is that your code should be in a public repository to >>> encourage collaboration. As the terms are written, though, your code is >>> required to be open source, and any libraries it depends on must be as >>> well. >>> >>> - Ryan >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: < >>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html >>> > >>> >>> ------------------------------ >>> >>> Message: 5 >>> Date: Fri, 13 Mar 2015 11:29:47 -0700 >>> From: Pine W <[email protected]> >>> To: Wikimedia Labs <[email protected]> >>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>> Message-ID: >>> <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV= >>> [email protected]> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Question: are there heightened security or privacy risks posed by having >>> non-open-source code running in Labs? >>> >>> Is anyone proactively auditing Labs software for open source compliance, >>> and if not, should this be done? >>> >>> Pine >>> On Mar 13, 2015 10:52 AM, "Ryan Lane" <[email protected]> wrote: >>> >>> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < >>> > [email protected]> wrote: >>> > >>> >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use >>> >> (verbatim): "Do not use or install any software unless the software is >>> >> licensed under an Open Source license". >>> >> What about tools and services made up of software themselves? Do they >>> >> have to be Open Source? >>> >> Strictly speaking, do the Terms of use require that all code be made >>> >> available to the public? >>> >> Thanks in advance. >>> >> >>> >> >>> > As the person who wrote the initial terms and included this I can >>> speak to >>> > the spirit of the term (I'm not a lawyer, so I won't try to go into any >>> > legal issues). >>> > >>> > I created Labs with the intent that it could be used as a mechanism to >>> > fork the projects as a whole, if necessary. A means to this end was >>> > including non-WMF employees in the process of infrastructure operations >>> > (which is outside the goals of the tools project in Labs). >>> Tools/services >>> > that are can't be distributed publicly harm that goal. Tools/services >>> that >>> > aren't open source completely break that goal. It's fine if you wish >>> to not >>> > maintain the code in a public git repo, but if another tool maintainer >>> > wishes to publish your code, there should be nothing blocking that. >>> > >>> > Depending on external closed source services is a debatable topic. I >>> know >>> > in the past we've decided to allow it. It goes against the spirit of >>> the >>> > project, but it doesn't require us to distribute close sourced >>> software in >>> > the case of a fork. >>> > >>> > My personal opinion is that your code should be in a public repository >>> to >>> > encourage collaboration. As the terms are written, though, your code is >>> > required to be open source, and any libraries it depends on must be as >>> well. >>> > >>> > - Ryan >>> > >>> > _______________________________________________ >>> > Labs-l mailing list >>> > [email protected] >>> > https://lists.wikimedia.org/mailman/listinfo/labs-l >>> > >>> > >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: < >>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html >>> > >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> Labs-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/labs-l >>> >>> >>> End of Labs-l Digest, Vol 39, Issue 13 >>> ************************************** >>> >> >> >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
