Hi John,
My queries are to obtain "inlinks" and "outlinks" for some articles I have in a group (x). Then I check (using python) if they have inlinks and outlinks from another group of articles. By now I am doing a query for each article. I wanted to obtain all links for group (x) and then do this comprovation....But getting all links for groups as big as 300000 articles would imply 6 million links. Is it possible to obtain all this or is there a MySQL/RAM limit? Thanks. Marc ᐧ 2015-03-13 19:29 GMT+01:00 <[email protected]>: > Send Labs-l mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/labs-l > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Labs-l digest..." > > > Today's Topics: > > 1. dimension well my queries for very large tables like > pagelinks - Tool Labs (Marc Miquel) > 2. Re: dimension well my queries for very large tables like > pagelinks - Tool Labs (John) > 3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt) > 4. Re: Questions regarding the Labs Terms of use (Ryan Lane) > 5. Re: Questions regarding the Labs Terms of use (Pine W) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 13 Mar 2015 17:59:09 +0100 > From: Marc Miquel <[email protected]> > To: "[email protected]" <[email protected]> > Subject: [Labs-l] dimension well my queries for very large tables like > pagelinks - Tool Labs > Message-ID: > <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj= > [email protected]> > Content-Type: text/plain; charset="utf-8" > > Hello guys, > > I have a question regarding Tool Labs. I am doing research on links and > although I know very well what I am looking for I struggle in how to get it > effectively... > > I need to know your opinion because you know very well the system and > what's feasible and what is not. > > I explain you what I need to do: > I have a list of articles for different languages which I need to check > their pagelinks and see where they point to and from where they point at > them. > > I now do a query for each article id in this list of articles, which goes > from 80000 in some wikipedias to 300000 in other and more. I have to do it > several times and it is very time consuming (several days). I wish I could > only count the total of links for each case but I need to see only some of > the links per article. > > I was thinking about getting all pagelinks and iterating using python > (which is the language I use for all this). This would be much faster > because I'd save all the queries, one per article, I am doing now. But > pagelinks table has millions of rows and I cannot load that because mysql > would die. I could buffer, but I haven't tried if it works also. > > I am considering creating a personal table in the database with titles, > ids, and inner joining to just obtain the pagelinks for these 300.000 > articles. With this I would just retrieve 20% of the database instead of > the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one > of both per row), or even more... loaded into python dictionaries and > lists. Would that be a problem...? I have no idea of how much RAM this > implies and how much I can use in Tool labs. > > I am totally lost when I get these problems related to scale...I thought > about writing to the IRC channel but I thought it was maybe too long and > too specific. If you give me any hint that would really help. > > Thank you very much! > > Cheers, > > Marc Miquel > ᐧ > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Fri, 13 Mar 2015 13:07:20 -0400 > From: John <[email protected]> > To: Wikimedia Labs <[email protected]> > Subject: Re: [Labs-l] dimension well my queries for very large tables > like pagelinks - Tool Labs > Message-ID: > <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw= > [email protected]> > Content-Type: text/plain; charset="utf-8" > > what kind of queries are you doing? odds are they can be optimized. > > On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <[email protected]> > wrote: > > > Hello guys, > > > > I have a question regarding Tool Labs. I am doing research on links and > > although I know very well what I am looking for I struggle in how to get > it > > effectively... > > > > I need to know your opinion because you know very well the system and > > what's feasible and what is not. > > > > I explain you what I need to do: > > I have a list of articles for different languages which I need to check > > their pagelinks and see where they point to and from where they point at > > them. > > > > I now do a query for each article id in this list of articles, which goes > > from 80000 in some wikipedias to 300000 in other and more. I have to do > it > > several times and it is very time consuming (several days). I wish I > could > > only count the total of links for each case but I need to see only some > of > > the links per article. > > > > I was thinking about getting all pagelinks and iterating using python > > (which is the language I use for all this). This would be much faster > > because I'd save all the queries, one per article, I am doing now. But > > pagelinks table has millions of rows and I cannot load that because mysql > > would die. I could buffer, but I haven't tried if it works also. > > > > I am considering creating a personal table in the database with titles, > > ids, and inner joining to just obtain the pagelinks for these 300.000 > > articles. With this I would just retrieve 20% of the database instead of > > the 100%. That would be maybe 8M rows sometimes (page_title or page_id, > one > > of both per row), or even more... loaded into python dictionaries and > > lists. Would that be a problem...? I have no idea of how much RAM this > > implies and how much I can use in Tool labs. > > > > I am totally lost when I get these problems related to scale...I thought > > about writing to the IRC channel but I thought it was maybe too long and > > too specific. If you give me any hint that would really help. > > > > Thank you very much! > > > > Cheers, > > > > Marc Miquel > > ᐧ > > > > _______________________________________________ > > Labs-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html > > > > ------------------------------ > > Message: 3 > Date: Fri, 13 Mar 2015 17:36:00 +0000 > From: Tim Landscheidt <[email protected]> > To: [email protected] > Subject: Re: [Labs-l] Questions regarding the Labs Terms of use > Message-ID: <[email protected]> > Content-Type: text/plain > > (anonymous) wrote: > > > [...] > > > To be clear: I'm not going to make my code proprietary in > > any way. I just wanted to know whether I'm entitled to ask > > for the source of every Labs bot ;-) > > Everyone is entitled to /ask/, but I don't think you have a > right to /receive/ the source :-). > > AFAIK, there are two main reasons for the clause: > > a) WMF doesn't want to have to deal with individual licences > that may or may not have the potential for litigation > ("The Software shall be used for Good, not Evil"). With > requiring OSI-approved, tried and true licences, the risk > is negligible. > > b) Bots and tools running on an infrastructure financed by > donors, like contributions to Wikipedia & Co., shouldn't > be usable for blackmail. Noone should be in a legal po- > sition to demand something "or else ..." The perpetuity > of OS licences guarantees that everyone can be truly > thankful to developers without having to fear that other- > wise they shut down devices, delete content, etc. > > But the nice thing about collaboratively developed open > source software is that it usually is of a better quality, > so clandestine code is often not that interesting. > > Tim > > > > > ------------------------------ > > Message: 4 > Date: Fri, 13 Mar 2015 11:52:18 -0600 > From: Ryan Lane <[email protected]> > To: Wikimedia Labs <[email protected]> > Subject: Re: [Labs-l] Questions regarding the Labs Terms of use > Message-ID: > < > calkgca3lv-sqoeibesm7ckc0gapjwph_b0hstx+actakmdu...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < > [email protected]> > wrote: > > > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use > > (verbatim): "Do not use or install any software unless the software is > > licensed under an Open Source license". > > What about tools and services made up of software themselves? Do they > have > > to be Open Source? > > Strictly speaking, do the Terms of use require that all code be made > > available to the public? > > Thanks in advance. > > > > > As the person who wrote the initial terms and included this I can speak to > the spirit of the term (I'm not a lawyer, so I won't try to go into any > legal issues). > > I created Labs with the intent that it could be used as a mechanism to fork > the projects as a whole, if necessary. A means to this end was including > non-WMF employees in the process of infrastructure operations (which is > outside the goals of the tools project in Labs). Tools/services that are > can't be distributed publicly harm that goal. Tools/services that aren't > open source completely break that goal. It's fine if you wish to not > maintain the code in a public git repo, but if another tool maintainer > wishes to publish your code, there should be nothing blocking that. > > Depending on external closed source services is a debatable topic. I know > in the past we've decided to allow it. It goes against the spirit of the > project, but it doesn't require us to distribute close sourced software in > the case of a fork. > > My personal opinion is that your code should be in a public repository to > encourage collaboration. As the terms are written, though, your code is > required to be open source, and any libraries it depends on must be as > well. > > - Ryan > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html > > > > ------------------------------ > > Message: 5 > Date: Fri, 13 Mar 2015 11:29:47 -0700 > From: Pine W <[email protected]> > To: Wikimedia Labs <[email protected]> > Subject: Re: [Labs-l] Questions regarding the Labs Terms of use > Message-ID: > <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV= > [email protected]> > Content-Type: text/plain; charset="utf-8" > > Question: are there heightened security or privacy risks posed by having > non-open-source code running in Labs? > > Is anyone proactively auditing Labs software for open source compliance, > and if not, should this be done? > > Pine > On Mar 13, 2015 10:52 AM, "Ryan Lane" <[email protected]> wrote: > > > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < > > [email protected]> wrote: > > > >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use > >> (verbatim): "Do not use or install any software unless the software is > >> licensed under an Open Source license". > >> What about tools and services made up of software themselves? Do they > >> have to be Open Source? > >> Strictly speaking, do the Terms of use require that all code be made > >> available to the public? > >> Thanks in advance. > >> > >> > > As the person who wrote the initial terms and included this I can speak > to > > the spirit of the term (I'm not a lawyer, so I won't try to go into any > > legal issues). > > > > I created Labs with the intent that it could be used as a mechanism to > > fork the projects as a whole, if necessary. A means to this end was > > including non-WMF employees in the process of infrastructure operations > > (which is outside the goals of the tools project in Labs). Tools/services > > that are can't be distributed publicly harm that goal. Tools/services > that > > aren't open source completely break that goal. It's fine if you wish to > not > > maintain the code in a public git repo, but if another tool maintainer > > wishes to publish your code, there should be nothing blocking that. > > > > Depending on external closed source services is a debatable topic. I know > > in the past we've decided to allow it. It goes against the spirit of the > > project, but it doesn't require us to distribute close sourced software > in > > the case of a fork. > > > > My personal opinion is that your code should be in a public repository to > > encourage collaboration. As the terms are written, though, your code is > > required to be open source, and any libraries it depends on must be as > well. > > > > - Ryan > > > > _______________________________________________ > > Labs-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html > > > > ------------------------------ > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > > > End of Labs-l Digest, Vol 39, Issue 13 > ************************************** >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
