where are you getting the list of 300k pages from? I want to get a feel for the kinds of queries your running so that we can optimize the process for you.
On Fri, Mar 13, 2015 at 2:53 PM, Marc Miquel <[email protected]> wrote: > I load from a file "page_titles" and "page_ids" and put them in a > dictionary. One option I haven't used would be putting than into a database > and INNER Joining with the pagelinks table to just obtain the links for > those articles. Still, if the list is 300.000, even this is just 20% of the > database, it is still a lot. > > Marc > ᐧ > > 2015-03-13 19:51 GMT+01:00 John <[email protected]>: > >> Where are you getting your list of pages from? >> >> On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <[email protected]> >> wrote: >> >>> Hi John, >>> >>> >>> My queries are to obtain "inlinks" and "outlinks" for some articles I >>> have in a group (x). Then I check (using python) if they have inlinks and >>> outlinks from another group of articles. By now I am doing a query for each >>> article. I wanted to obtain all links for group (x) and then do this >>> comprovation....But getting all links for groups as big as 300000 articles >>> would imply 6 million links. Is it possible to obtain all this or is there >>> a MySQL/RAM limit? >>> >>> Thanks. >>> >>> Marc >>> >>> ᐧ >>> >>> 2015-03-13 19:29 GMT+01:00 <[email protected]>: >>> >>>> Send Labs-l mailing list submissions to >>>> [email protected] >>>> >>>> To subscribe or unsubscribe via the World Wide Web, visit >>>> https://lists.wikimedia.org/mailman/listinfo/labs-l >>>> or, via email, send a message with subject or body 'help' to >>>> [email protected] >>>> >>>> You can reach the person managing the list at >>>> [email protected] >>>> >>>> When replying, please edit your Subject line so it is more specific >>>> than "Re: Contents of Labs-l digest..." >>>> >>>> >>>> Today's Topics: >>>> >>>> 1. dimension well my queries for very large tables like >>>> pagelinks - Tool Labs (Marc Miquel) >>>> 2. Re: dimension well my queries for very large tables like >>>> pagelinks - Tool Labs (John) >>>> 3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt) >>>> 4. Re: Questions regarding the Labs Terms of use (Ryan Lane) >>>> 5. Re: Questions regarding the Labs Terms of use (Pine W) >>>> >>>> >>>> ---------------------------------------------------------------------- >>>> >>>> Message: 1 >>>> Date: Fri, 13 Mar 2015 17:59:09 +0100 >>>> From: Marc Miquel <[email protected]> >>>> To: "[email protected]" <[email protected]> >>>> Subject: [Labs-l] dimension well my queries for very large tables like >>>> pagelinks - Tool Labs >>>> Message-ID: >>>> <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj= >>>> [email protected]> >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> Hello guys, >>>> >>>> I have a question regarding Tool Labs. I am doing research on links and >>>> although I know very well what I am looking for I struggle in how to >>>> get it >>>> effectively... >>>> >>>> I need to know your opinion because you know very well the system and >>>> what's feasible and what is not. >>>> >>>> I explain you what I need to do: >>>> I have a list of articles for different languages which I need to check >>>> their pagelinks and see where they point to and from where they point at >>>> them. >>>> >>>> I now do a query for each article id in this list of articles, which >>>> goes >>>> from 80000 in some wikipedias to 300000 in other and more. I have to do >>>> it >>>> several times and it is very time consuming (several days). I wish I >>>> could >>>> only count the total of links for each case but I need to see only some >>>> of >>>> the links per article. >>>> >>>> I was thinking about getting all pagelinks and iterating using python >>>> (which is the language I use for all this). This would be much faster >>>> because I'd save all the queries, one per article, I am doing now. But >>>> pagelinks table has millions of rows and I cannot load that because >>>> mysql >>>> would die. I could buffer, but I haven't tried if it works also. >>>> >>>> I am considering creating a personal table in the database with titles, >>>> ids, and inner joining to just obtain the pagelinks for these 300.000 >>>> articles. With this I would just retrieve 20% of the database instead of >>>> the 100%. That would be maybe 8M rows sometimes (page_title or page_id, >>>> one >>>> of both per row), or even more... loaded into python dictionaries and >>>> lists. Would that be a problem...? I have no idea of how much RAM this >>>> implies and how much I can use in Tool labs. >>>> >>>> I am totally lost when I get these problems related to scale...I thought >>>> about writing to the IRC channel but I thought it was maybe too long and >>>> too specific. If you give me any hint that would really help. >>>> >>>> Thank you very much! >>>> >>>> Cheers, >>>> >>>> Marc Miquel >>>> ᐧ >>>> -------------- next part -------------- >>>> An HTML attachment was scrubbed... >>>> URL: < >>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html >>>> > >>>> >>>> ------------------------------ >>>> >>>> Message: 2 >>>> Date: Fri, 13 Mar 2015 13:07:20 -0400 >>>> From: John <[email protected]> >>>> To: Wikimedia Labs <[email protected]> >>>> Subject: Re: [Labs-l] dimension well my queries for very large tables >>>> like pagelinks - Tool Labs >>>> Message-ID: >>>> <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw= >>>> [email protected]> >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> what kind of queries are you doing? odds are they can be optimized. >>>> >>>> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <[email protected]> >>>> wrote: >>>> >>>> > Hello guys, >>>> > >>>> > I have a question regarding Tool Labs. I am doing research on links >>>> and >>>> > although I know very well what I am looking for I struggle in how to >>>> get it >>>> > effectively... >>>> > >>>> > I need to know your opinion because you know very well the system and >>>> > what's feasible and what is not. >>>> > >>>> > I explain you what I need to do: >>>> > I have a list of articles for different languages which I need to >>>> check >>>> > their pagelinks and see where they point to and from where they point >>>> at >>>> > them. >>>> > >>>> > I now do a query for each article id in this list of articles, which >>>> goes >>>> > from 80000 in some wikipedias to 300000 in other and more. I have to >>>> do it >>>> > several times and it is very time consuming (several days). I wish I >>>> could >>>> > only count the total of links for each case but I need to see only >>>> some of >>>> > the links per article. >>>> > >>>> > I was thinking about getting all pagelinks and iterating using python >>>> > (which is the language I use for all this). This would be much faster >>>> > because I'd save all the queries, one per article, I am doing now. But >>>> > pagelinks table has millions of rows and I cannot load that because >>>> mysql >>>> > would die. I could buffer, but I haven't tried if it works also. >>>> > >>>> > I am considering creating a personal table in the database with >>>> titles, >>>> > ids, and inner joining to just obtain the pagelinks for these 300.000 >>>> > articles. With this I would just retrieve 20% of the database instead >>>> of >>>> > the 100%. That would be maybe 8M rows sometimes (page_title or >>>> page_id, one >>>> > of both per row), or even more... loaded into python dictionaries and >>>> > lists. Would that be a problem...? I have no idea of how much RAM this >>>> > implies and how much I can use in Tool labs. >>>> > >>>> > I am totally lost when I get these problems related to scale...I >>>> thought >>>> > about writing to the IRC channel but I thought it was maybe too long >>>> and >>>> > too specific. If you give me any hint that would really help. >>>> > >>>> > Thank you very much! >>>> > >>>> > Cheers, >>>> > >>>> > Marc Miquel >>>> > ᐧ >>>> > >>>> > _______________________________________________ >>>> > Labs-l mailing list >>>> > [email protected] >>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l >>>> > >>>> > >>>> -------------- next part -------------- >>>> An HTML attachment was scrubbed... >>>> URL: < >>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html >>>> > >>>> >>>> ------------------------------ >>>> >>>> Message: 3 >>>> Date: Fri, 13 Mar 2015 17:36:00 +0000 >>>> From: Tim Landscheidt <[email protected]> >>>> To: [email protected] >>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>>> Message-ID: <[email protected]> >>>> Content-Type: text/plain >>>> >>>> (anonymous) wrote: >>>> >>>> > [...] >>>> >>>> > To be clear: I'm not going to make my code proprietary in >>>> > any way. I just wanted to know whether I'm entitled to ask >>>> > for the source of every Labs bot ;-) >>>> >>>> Everyone is entitled to /ask/, but I don't think you have a >>>> right to /receive/ the source :-). >>>> >>>> AFAIK, there are two main reasons for the clause: >>>> >>>> a) WMF doesn't want to have to deal with individual licences >>>> that may or may not have the potential for litigation >>>> ("The Software shall be used for Good, not Evil"). With >>>> requiring OSI-approved, tried and true licences, the risk >>>> is negligible. >>>> >>>> b) Bots and tools running on an infrastructure financed by >>>> donors, like contributions to Wikipedia & Co., shouldn't >>>> be usable for blackmail. Noone should be in a legal po- >>>> sition to demand something "or else ..." The perpetuity >>>> of OS licences guarantees that everyone can be truly >>>> thankful to developers without having to fear that other- >>>> wise they shut down devices, delete content, etc. >>>> >>>> But the nice thing about collaboratively developed open >>>> source software is that it usually is of a better quality, >>>> so clandestine code is often not that interesting. >>>> >>>> Tim >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> Message: 4 >>>> Date: Fri, 13 Mar 2015 11:52:18 -0600 >>>> From: Ryan Lane <[email protected]> >>>> To: Wikimedia Labs <[email protected]> >>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>>> Message-ID: >>>> < >>>> calkgca3lv-sqoeibesm7ckc0gapjwph_b0hstx+actakmdu...@mail.gmail.com> >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < >>>> [email protected]> >>>> wrote: >>>> >>>> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use >>>> > (verbatim): "Do not use or install any software unless the software is >>>> > licensed under an Open Source license". >>>> > What about tools and services made up of software themselves? Do they >>>> have >>>> > to be Open Source? >>>> > Strictly speaking, do the Terms of use require that all code be made >>>> > available to the public? >>>> > Thanks in advance. >>>> > >>>> > >>>> As the person who wrote the initial terms and included this I can speak >>>> to >>>> the spirit of the term (I'm not a lawyer, so I won't try to go into any >>>> legal issues). >>>> >>>> I created Labs with the intent that it could be used as a mechanism to >>>> fork >>>> the projects as a whole, if necessary. A means to this end was including >>>> non-WMF employees in the process of infrastructure operations (which is >>>> outside the goals of the tools project in Labs). Tools/services that are >>>> can't be distributed publicly harm that goal. Tools/services that aren't >>>> open source completely break that goal. It's fine if you wish to not >>>> maintain the code in a public git repo, but if another tool maintainer >>>> wishes to publish your code, there should be nothing blocking that. >>>> >>>> Depending on external closed source services is a debatable topic. I >>>> know >>>> in the past we've decided to allow it. It goes against the spirit of the >>>> project, but it doesn't require us to distribute close sourced software >>>> in >>>> the case of a fork. >>>> >>>> My personal opinion is that your code should be in a public repository >>>> to >>>> encourage collaboration. As the terms are written, though, your code is >>>> required to be open source, and any libraries it depends on must be as >>>> well. >>>> >>>> - Ryan >>>> -------------- next part -------------- >>>> An HTML attachment was scrubbed... >>>> URL: < >>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html >>>> > >>>> >>>> ------------------------------ >>>> >>>> Message: 5 >>>> Date: Fri, 13 Mar 2015 11:29:47 -0700 >>>> From: Pine W <[email protected]> >>>> To: Wikimedia Labs <[email protected]> >>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>>> Message-ID: >>>> <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV= >>>> [email protected]> >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> Question: are there heightened security or privacy risks posed by having >>>> non-open-source code running in Labs? >>>> >>>> Is anyone proactively auditing Labs software for open source compliance, >>>> and if not, should this be done? >>>> >>>> Pine >>>> On Mar 13, 2015 10:52 AM, "Ryan Lane" <[email protected]> wrote: >>>> >>>> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < >>>> > [email protected]> wrote: >>>> > >>>> >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use >>>> >> (verbatim): "Do not use or install any software unless the software >>>> is >>>> >> licensed under an Open Source license". >>>> >> What about tools and services made up of software themselves? Do they >>>> >> have to be Open Source? >>>> >> Strictly speaking, do the Terms of use require that all code be made >>>> >> available to the public? >>>> >> Thanks in advance. >>>> >> >>>> >> >>>> > As the person who wrote the initial terms and included this I can >>>> speak to >>>> > the spirit of the term (I'm not a lawyer, so I won't try to go into >>>> any >>>> > legal issues). >>>> > >>>> > I created Labs with the intent that it could be used as a mechanism to >>>> > fork the projects as a whole, if necessary. A means to this end was >>>> > including non-WMF employees in the process of infrastructure >>>> operations >>>> > (which is outside the goals of the tools project in Labs). >>>> Tools/services >>>> > that are can't be distributed publicly harm that goal. Tools/services >>>> that >>>> > aren't open source completely break that goal. It's fine if you wish >>>> to not >>>> > maintain the code in a public git repo, but if another tool maintainer >>>> > wishes to publish your code, there should be nothing blocking that. >>>> > >>>> > Depending on external closed source services is a debatable topic. I >>>> know >>>> > in the past we've decided to allow it. It goes against the spirit of >>>> the >>>> > project, but it doesn't require us to distribute close sourced >>>> software in >>>> > the case of a fork. >>>> > >>>> > My personal opinion is that your code should be in a public >>>> repository to >>>> > encourage collaboration. As the terms are written, though, your code >>>> is >>>> > required to be open source, and any libraries it depends on must be >>>> as well. >>>> > >>>> > - Ryan >>>> > >>>> > _______________________________________________ >>>> > Labs-l mailing list >>>> > [email protected] >>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l >>>> > >>>> > >>>> -------------- next part -------------- >>>> An HTML attachment was scrubbed... >>>> URL: < >>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html >>>> > >>>> >>>> ------------------------------ >>>> >>>> _______________________________________________ >>>> Labs-l mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/labs-l >>>> >>>> >>>> End of Labs-l Digest, Vol 39, Issue 13 >>>> ************************************** >>>> >>> >>> >> >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
