Where are you getting your list of pages from? On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <[email protected]> wrote:
> Hi John, > > > My queries are to obtain "inlinks" and "outlinks" for some articles I have > in a group (x). Then I check (using python) if they have inlinks and > outlinks from another group of articles. By now I am doing a query for each > article. I wanted to obtain all links for group (x) and then do this > comprovation....But getting all links for groups as big as 300000 articles > would imply 6 million links. Is it possible to obtain all this or is there > a MySQL/RAM limit? > > Thanks. > > Marc > > ᐧ > > 2015-03-13 19:29 GMT+01:00 <[email protected]>: > >> Send Labs-l mailing list submissions to >> [email protected] >> >> To subscribe or unsubscribe via the World Wide Web, visit >> https://lists.wikimedia.org/mailman/listinfo/labs-l >> or, via email, send a message with subject or body 'help' to >> [email protected] >> >> You can reach the person managing the list at >> [email protected] >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of Labs-l digest..." >> >> >> Today's Topics: >> >> 1. dimension well my queries for very large tables like >> pagelinks - Tool Labs (Marc Miquel) >> 2. Re: dimension well my queries for very large tables like >> pagelinks - Tool Labs (John) >> 3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt) >> 4. Re: Questions regarding the Labs Terms of use (Ryan Lane) >> 5. Re: Questions regarding the Labs Terms of use (Pine W) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Fri, 13 Mar 2015 17:59:09 +0100 >> From: Marc Miquel <[email protected]> >> To: "[email protected]" <[email protected]> >> Subject: [Labs-l] dimension well my queries for very large tables like >> pagelinks - Tool Labs >> Message-ID: >> <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj= >> [email protected]> >> Content-Type: text/plain; charset="utf-8" >> >> Hello guys, >> >> I have a question regarding Tool Labs. I am doing research on links and >> although I know very well what I am looking for I struggle in how to get >> it >> effectively... >> >> I need to know your opinion because you know very well the system and >> what's feasible and what is not. >> >> I explain you what I need to do: >> I have a list of articles for different languages which I need to check >> their pagelinks and see where they point to and from where they point at >> them. >> >> I now do a query for each article id in this list of articles, which goes >> from 80000 in some wikipedias to 300000 in other and more. I have to do it >> several times and it is very time consuming (several days). I wish I could >> only count the total of links for each case but I need to see only some of >> the links per article. >> >> I was thinking about getting all pagelinks and iterating using python >> (which is the language I use for all this). This would be much faster >> because I'd save all the queries, one per article, I am doing now. But >> pagelinks table has millions of rows and I cannot load that because mysql >> would die. I could buffer, but I haven't tried if it works also. >> >> I am considering creating a personal table in the database with titles, >> ids, and inner joining to just obtain the pagelinks for these 300.000 >> articles. With this I would just retrieve 20% of the database instead of >> the 100%. That would be maybe 8M rows sometimes (page_title or page_id, >> one >> of both per row), or even more... loaded into python dictionaries and >> lists. Would that be a problem...? I have no idea of how much RAM this >> implies and how much I can use in Tool labs. >> >> I am totally lost when I get these problems related to scale...I thought >> about writing to the IRC channel but I thought it was maybe too long and >> too specific. If you give me any hint that would really help. >> >> Thank you very much! >> >> Cheers, >> >> Marc Miquel >> ᐧ >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html >> > >> >> ------------------------------ >> >> Message: 2 >> Date: Fri, 13 Mar 2015 13:07:20 -0400 >> From: John <[email protected]> >> To: Wikimedia Labs <[email protected]> >> Subject: Re: [Labs-l] dimension well my queries for very large tables >> like pagelinks - Tool Labs >> Message-ID: >> <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw= >> [email protected]> >> Content-Type: text/plain; charset="utf-8" >> >> what kind of queries are you doing? odds are they can be optimized. >> >> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <[email protected]> >> wrote: >> >> > Hello guys, >> > >> > I have a question regarding Tool Labs. I am doing research on links and >> > although I know very well what I am looking for I struggle in how to >> get it >> > effectively... >> > >> > I need to know your opinion because you know very well the system and >> > what's feasible and what is not. >> > >> > I explain you what I need to do: >> > I have a list of articles for different languages which I need to check >> > their pagelinks and see where they point to and from where they point at >> > them. >> > >> > I now do a query for each article id in this list of articles, which >> goes >> > from 80000 in some wikipedias to 300000 in other and more. I have to do >> it >> > several times and it is very time consuming (several days). I wish I >> could >> > only count the total of links for each case but I need to see only some >> of >> > the links per article. >> > >> > I was thinking about getting all pagelinks and iterating using python >> > (which is the language I use for all this). This would be much faster >> > because I'd save all the queries, one per article, I am doing now. But >> > pagelinks table has millions of rows and I cannot load that because >> mysql >> > would die. I could buffer, but I haven't tried if it works also. >> > >> > I am considering creating a personal table in the database with titles, >> > ids, and inner joining to just obtain the pagelinks for these 300.000 >> > articles. With this I would just retrieve 20% of the database instead of >> > the 100%. That would be maybe 8M rows sometimes (page_title or page_id, >> one >> > of both per row), or even more... loaded into python dictionaries and >> > lists. Would that be a problem...? I have no idea of how much RAM this >> > implies and how much I can use in Tool labs. >> > >> > I am totally lost when I get these problems related to scale...I thought >> > about writing to the IRC channel but I thought it was maybe too long and >> > too specific. If you give me any hint that would really help. >> > >> > Thank you very much! >> > >> > Cheers, >> > >> > Marc Miquel >> > ᐧ >> > >> > _______________________________________________ >> > Labs-l mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/labs-l >> > >> > >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html >> > >> >> ------------------------------ >> >> Message: 3 >> Date: Fri, 13 Mar 2015 17:36:00 +0000 >> From: Tim Landscheidt <[email protected]> >> To: [email protected] >> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >> Message-ID: <[email protected]> >> Content-Type: text/plain >> >> (anonymous) wrote: >> >> > [...] >> >> > To be clear: I'm not going to make my code proprietary in >> > any way. I just wanted to know whether I'm entitled to ask >> > for the source of every Labs bot ;-) >> >> Everyone is entitled to /ask/, but I don't think you have a >> right to /receive/ the source :-). >> >> AFAIK, there are two main reasons for the clause: >> >> a) WMF doesn't want to have to deal with individual licences >> that may or may not have the potential for litigation >> ("The Software shall be used for Good, not Evil"). With >> requiring OSI-approved, tried and true licences, the risk >> is negligible. >> >> b) Bots and tools running on an infrastructure financed by >> donors, like contributions to Wikipedia & Co., shouldn't >> be usable for blackmail. Noone should be in a legal po- >> sition to demand something "or else ..." The perpetuity >> of OS licences guarantees that everyone can be truly >> thankful to developers without having to fear that other- >> wise they shut down devices, delete content, etc. >> >> But the nice thing about collaboratively developed open >> source software is that it usually is of a better quality, >> so clandestine code is often not that interesting. >> >> Tim >> >> >> >> >> ------------------------------ >> >> Message: 4 >> Date: Fri, 13 Mar 2015 11:52:18 -0600 >> From: Ryan Lane <[email protected]> >> To: Wikimedia Labs <[email protected]> >> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >> Message-ID: >> < >> calkgca3lv-sqoeibesm7ckc0gapjwph_b0hstx+actakmdu...@mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < >> [email protected]> >> wrote: >> >> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use >> > (verbatim): "Do not use or install any software unless the software is >> > licensed under an Open Source license". >> > What about tools and services made up of software themselves? Do they >> have >> > to be Open Source? >> > Strictly speaking, do the Terms of use require that all code be made >> > available to the public? >> > Thanks in advance. >> > >> > >> As the person who wrote the initial terms and included this I can speak to >> the spirit of the term (I'm not a lawyer, so I won't try to go into any >> legal issues). >> >> I created Labs with the intent that it could be used as a mechanism to >> fork >> the projects as a whole, if necessary. A means to this end was including >> non-WMF employees in the process of infrastructure operations (which is >> outside the goals of the tools project in Labs). Tools/services that are >> can't be distributed publicly harm that goal. Tools/services that aren't >> open source completely break that goal. It's fine if you wish to not >> maintain the code in a public git repo, but if another tool maintainer >> wishes to publish your code, there should be nothing blocking that. >> >> Depending on external closed source services is a debatable topic. I know >> in the past we've decided to allow it. It goes against the spirit of the >> project, but it doesn't require us to distribute close sourced software in >> the case of a fork. >> >> My personal opinion is that your code should be in a public repository to >> encourage collaboration. As the terms are written, though, your code is >> required to be open source, and any libraries it depends on must be as >> well. >> >> - Ryan >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html >> > >> >> ------------------------------ >> >> Message: 5 >> Date: Fri, 13 Mar 2015 11:29:47 -0700 >> From: Pine W <[email protected]> >> To: Wikimedia Labs <[email protected]> >> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >> Message-ID: >> <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV= >> [email protected]> >> Content-Type: text/plain; charset="utf-8" >> >> Question: are there heightened security or privacy risks posed by having >> non-open-source code running in Labs? >> >> Is anyone proactively auditing Labs software for open source compliance, >> and if not, should this be done? >> >> Pine >> On Mar 13, 2015 10:52 AM, "Ryan Lane" <[email protected]> wrote: >> >> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < >> > [email protected]> wrote: >> > >> >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use >> >> (verbatim): "Do not use or install any software unless the software is >> >> licensed under an Open Source license". >> >> What about tools and services made up of software themselves? Do they >> >> have to be Open Source? >> >> Strictly speaking, do the Terms of use require that all code be made >> >> available to the public? >> >> Thanks in advance. >> >> >> >> >> > As the person who wrote the initial terms and included this I can speak >> to >> > the spirit of the term (I'm not a lawyer, so I won't try to go into any >> > legal issues). >> > >> > I created Labs with the intent that it could be used as a mechanism to >> > fork the projects as a whole, if necessary. A means to this end was >> > including non-WMF employees in the process of infrastructure operations >> > (which is outside the goals of the tools project in Labs). >> Tools/services >> > that are can't be distributed publicly harm that goal. Tools/services >> that >> > aren't open source completely break that goal. It's fine if you wish to >> not >> > maintain the code in a public git repo, but if another tool maintainer >> > wishes to publish your code, there should be nothing blocking that. >> > >> > Depending on external closed source services is a debatable topic. I >> know >> > in the past we've decided to allow it. It goes against the spirit of the >> > project, but it doesn't require us to distribute close sourced software >> in >> > the case of a fork. >> > >> > My personal opinion is that your code should be in a public repository >> to >> > encourage collaboration. As the terms are written, though, your code is >> > required to be open source, and any libraries it depends on must be as >> well. >> > >> > - Ryan >> > >> > _______________________________________________ >> > Labs-l mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/labs-l >> > >> > >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html >> > >> >> ------------------------------ >> >> _______________________________________________ >> Labs-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/labs-l >> >> >> End of Labs-l Digest, Vol 39, Issue 13 >> ************************************** >> > >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
