OK, No offense here, your not being helpful. Are you pulling from a category, using page text or some format already in the database? The reason I am asking is that depending on how you are selecting the articles, it might be possible to match query the database in a manor that optimizes the process and makes the overall time drop drastically. There have been a few quires that I have played with in the past that originally took hours or even days, and we where able to get them down to a few minutes. However without more information the vague data we are given so far isn't that helpful.
On Fri, Mar 13, 2015 at 3:01 PM, Marc Miquel <[email protected]> wrote: > I get them according to some selection I do according to other parameters > more related to the content. The selection of these 300000, which could be > either 30000 or even 500000 for other cases (like german wiki) is not an > issue. The link analysis to see if these 300.000 receive links from another > group of articles is my concern... > > Marc > ᐧ > > 2015-03-13 19:56 GMT+01:00 John <[email protected]>: > >> where are you getting the list of 300k pages from? I want to get a feel >> for the kinds of queries your running so that we can optimize the process >> for you. >> >> On Fri, Mar 13, 2015 at 2:53 PM, Marc Miquel <[email protected]> >> wrote: >> >>> I load from a file "page_titles" and "page_ids" and put them in a >>> dictionary. One option I haven't used would be putting than into a database >>> and INNER Joining with the pagelinks table to just obtain the links for >>> those articles. Still, if the list is 300.000, even this is just 20% of the >>> database, it is still a lot. >>> >>> Marc >>> ᐧ >>> >>> 2015-03-13 19:51 GMT+01:00 John <[email protected]>: >>> >>>> Where are you getting your list of pages from? >>>> >>>> On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <[email protected]> >>>> wrote: >>>> >>>>> Hi John, >>>>> >>>>> >>>>> My queries are to obtain "inlinks" and "outlinks" for some articles I >>>>> have in a group (x). Then I check (using python) if they have inlinks and >>>>> outlinks from another group of articles. By now I am doing a query for >>>>> each >>>>> article. I wanted to obtain all links for group (x) and then do this >>>>> comprovation....But getting all links for groups as big as 300000 articles >>>>> would imply 6 million links. Is it possible to obtain all this or is there >>>>> a MySQL/RAM limit? >>>>> >>>>> Thanks. >>>>> >>>>> Marc >>>>> >>>>> ᐧ >>>>> >>>>> 2015-03-13 19:29 GMT+01:00 <[email protected]>: >>>>> >>>>>> Send Labs-l mailing list submissions to >>>>>> [email protected] >>>>>> >>>>>> To subscribe or unsubscribe via the World Wide Web, visit >>>>>> https://lists.wikimedia.org/mailman/listinfo/labs-l >>>>>> or, via email, send a message with subject or body 'help' to >>>>>> [email protected] >>>>>> >>>>>> You can reach the person managing the list at >>>>>> [email protected] >>>>>> >>>>>> When replying, please edit your Subject line so it is more specific >>>>>> than "Re: Contents of Labs-l digest..." >>>>>> >>>>>> >>>>>> Today's Topics: >>>>>> >>>>>> 1. dimension well my queries for very large tables like >>>>>> pagelinks - Tool Labs (Marc Miquel) >>>>>> 2. Re: dimension well my queries for very large tables like >>>>>> pagelinks - Tool Labs (John) >>>>>> 3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt) >>>>>> 4. Re: Questions regarding the Labs Terms of use (Ryan Lane) >>>>>> 5. Re: Questions regarding the Labs Terms of use (Pine W) >>>>>> >>>>>> >>>>>> ---------------------------------------------------------------------- >>>>>> >>>>>> Message: 1 >>>>>> Date: Fri, 13 Mar 2015 17:59:09 +0100 >>>>>> From: Marc Miquel <[email protected]> >>>>>> To: "[email protected]" <[email protected]> >>>>>> Subject: [Labs-l] dimension well my queries for very large tables like >>>>>> pagelinks - Tool Labs >>>>>> Message-ID: >>>>>> <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj= >>>>>> [email protected]> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> Hello guys, >>>>>> >>>>>> I have a question regarding Tool Labs. I am doing research on links >>>>>> and >>>>>> although I know very well what I am looking for I struggle in how to >>>>>> get it >>>>>> effectively... >>>>>> >>>>>> I need to know your opinion because you know very well the system and >>>>>> what's feasible and what is not. >>>>>> >>>>>> I explain you what I need to do: >>>>>> I have a list of articles for different languages which I need to >>>>>> check >>>>>> their pagelinks and see where they point to and from where they point >>>>>> at >>>>>> them. >>>>>> >>>>>> I now do a query for each article id in this list of articles, which >>>>>> goes >>>>>> from 80000 in some wikipedias to 300000 in other and more. I have to >>>>>> do it >>>>>> several times and it is very time consuming (several days). I wish I >>>>>> could >>>>>> only count the total of links for each case but I need to see only >>>>>> some of >>>>>> the links per article. >>>>>> >>>>>> I was thinking about getting all pagelinks and iterating using python >>>>>> (which is the language I use for all this). This would be much faster >>>>>> because I'd save all the queries, one per article, I am doing now. But >>>>>> pagelinks table has millions of rows and I cannot load that because >>>>>> mysql >>>>>> would die. I could buffer, but I haven't tried if it works also. >>>>>> >>>>>> I am considering creating a personal table in the database with >>>>>> titles, >>>>>> ids, and inner joining to just obtain the pagelinks for these 300.000 >>>>>> articles. With this I would just retrieve 20% of the database instead >>>>>> of >>>>>> the 100%. That would be maybe 8M rows sometimes (page_title or >>>>>> page_id, one >>>>>> of both per row), or even more... loaded into python dictionaries and >>>>>> lists. Would that be a problem...? I have no idea of how much RAM this >>>>>> implies and how much I can use in Tool labs. >>>>>> >>>>>> I am totally lost when I get these problems related to scale...I >>>>>> thought >>>>>> about writing to the IRC channel but I thought it was maybe too long >>>>>> and >>>>>> too specific. If you give me any hint that would really help. >>>>>> >>>>>> Thank you very much! >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Marc Miquel >>>>>> ᐧ >>>>>> -------------- next part -------------- >>>>>> An HTML attachment was scrubbed... >>>>>> URL: < >>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html >>>>>> > >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> Message: 2 >>>>>> Date: Fri, 13 Mar 2015 13:07:20 -0400 >>>>>> From: John <[email protected]> >>>>>> To: Wikimedia Labs <[email protected]> >>>>>> Subject: Re: [Labs-l] dimension well my queries for very large tables >>>>>> like pagelinks - Tool Labs >>>>>> Message-ID: >>>>>> <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw= >>>>>> [email protected]> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> what kind of queries are you doing? odds are they can be optimized. >>>>>> >>>>>> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <[email protected]> >>>>>> wrote: >>>>>> >>>>>> > Hello guys, >>>>>> > >>>>>> > I have a question regarding Tool Labs. I am doing research on links >>>>>> and >>>>>> > although I know very well what I am looking for I struggle in how >>>>>> to get it >>>>>> > effectively... >>>>>> > >>>>>> > I need to know your opinion because you know very well the system >>>>>> and >>>>>> > what's feasible and what is not. >>>>>> > >>>>>> > I explain you what I need to do: >>>>>> > I have a list of articles for different languages which I need to >>>>>> check >>>>>> > their pagelinks and see where they point to and from where they >>>>>> point at >>>>>> > them. >>>>>> > >>>>>> > I now do a query for each article id in this list of articles, >>>>>> which goes >>>>>> > from 80000 in some wikipedias to 300000 in other and more. I have >>>>>> to do it >>>>>> > several times and it is very time consuming (several days). I wish >>>>>> I could >>>>>> > only count the total of links for each case but I need to see only >>>>>> some of >>>>>> > the links per article. >>>>>> > >>>>>> > I was thinking about getting all pagelinks and iterating using >>>>>> python >>>>>> > (which is the language I use for all this). This would be much >>>>>> faster >>>>>> > because I'd save all the queries, one per article, I am doing now. >>>>>> But >>>>>> > pagelinks table has millions of rows and I cannot load that because >>>>>> mysql >>>>>> > would die. I could buffer, but I haven't tried if it works also. >>>>>> > >>>>>> > I am considering creating a personal table in the database with >>>>>> titles, >>>>>> > ids, and inner joining to just obtain the pagelinks for these >>>>>> 300.000 >>>>>> > articles. With this I would just retrieve 20% of the database >>>>>> instead of >>>>>> > the 100%. That would be maybe 8M rows sometimes (page_title or >>>>>> page_id, one >>>>>> > of both per row), or even more... loaded into python dictionaries >>>>>> and >>>>>> > lists. Would that be a problem...? I have no idea of how much RAM >>>>>> this >>>>>> > implies and how much I can use in Tool labs. >>>>>> > >>>>>> > I am totally lost when I get these problems related to scale...I >>>>>> thought >>>>>> > about writing to the IRC channel but I thought it was maybe too >>>>>> long and >>>>>> > too specific. If you give me any hint that would really help. >>>>>> > >>>>>> > Thank you very much! >>>>>> > >>>>>> > Cheers, >>>>>> > >>>>>> > Marc Miquel >>>>>> > ᐧ >>>>>> > >>>>>> > _______________________________________________ >>>>>> > Labs-l mailing list >>>>>> > [email protected] >>>>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l >>>>>> > >>>>>> > >>>>>> -------------- next part -------------- >>>>>> An HTML attachment was scrubbed... >>>>>> URL: < >>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html >>>>>> > >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> Message: 3 >>>>>> Date: Fri, 13 Mar 2015 17:36:00 +0000 >>>>>> From: Tim Landscheidt <[email protected]> >>>>>> To: [email protected] >>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>>>>> Message-ID: <[email protected]> >>>>>> Content-Type: text/plain >>>>>> >>>>>> (anonymous) wrote: >>>>>> >>>>>> > [...] >>>>>> >>>>>> > To be clear: I'm not going to make my code proprietary in >>>>>> > any way. I just wanted to know whether I'm entitled to ask >>>>>> > for the source of every Labs bot ;-) >>>>>> >>>>>> Everyone is entitled to /ask/, but I don't think you have a >>>>>> right to /receive/ the source :-). >>>>>> >>>>>> AFAIK, there are two main reasons for the clause: >>>>>> >>>>>> a) WMF doesn't want to have to deal with individual licences >>>>>> that may or may not have the potential for litigation >>>>>> ("The Software shall be used for Good, not Evil"). With >>>>>> requiring OSI-approved, tried and true licences, the risk >>>>>> is negligible. >>>>>> >>>>>> b) Bots and tools running on an infrastructure financed by >>>>>> donors, like contributions to Wikipedia & Co., shouldn't >>>>>> be usable for blackmail. Noone should be in a legal po- >>>>>> sition to demand something "or else ..." The perpetuity >>>>>> of OS licences guarantees that everyone can be truly >>>>>> thankful to developers without having to fear that other- >>>>>> wise they shut down devices, delete content, etc. >>>>>> >>>>>> But the nice thing about collaboratively developed open >>>>>> source software is that it usually is of a better quality, >>>>>> so clandestine code is often not that interesting. >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> Message: 4 >>>>>> Date: Fri, 13 Mar 2015 11:52:18 -0600 >>>>>> From: Ryan Lane <[email protected]> >>>>>> To: Wikimedia Labs <[email protected]> >>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>>>>> Message-ID: >>>>>> < >>>>>> calkgca3lv-sqoeibesm7ckc0gapjwph_b0hstx+actakmdu...@mail.gmail.com> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < >>>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use >>>>>> > (verbatim): "Do not use or install any software unless the software >>>>>> is >>>>>> > licensed under an Open Source license". >>>>>> > What about tools and services made up of software themselves? Do >>>>>> they have >>>>>> > to be Open Source? >>>>>> > Strictly speaking, do the Terms of use require that all code be made >>>>>> > available to the public? >>>>>> > Thanks in advance. >>>>>> > >>>>>> > >>>>>> As the person who wrote the initial terms and included this I can >>>>>> speak to >>>>>> the spirit of the term (I'm not a lawyer, so I won't try to go into >>>>>> any >>>>>> legal issues). >>>>>> >>>>>> I created Labs with the intent that it could be used as a mechanism >>>>>> to fork >>>>>> the projects as a whole, if necessary. A means to this end was >>>>>> including >>>>>> non-WMF employees in the process of infrastructure operations (which >>>>>> is >>>>>> outside the goals of the tools project in Labs). Tools/services that >>>>>> are >>>>>> can't be distributed publicly harm that goal. Tools/services that >>>>>> aren't >>>>>> open source completely break that goal. It's fine if you wish to not >>>>>> maintain the code in a public git repo, but if another tool maintainer >>>>>> wishes to publish your code, there should be nothing blocking that. >>>>>> >>>>>> Depending on external closed source services is a debatable topic. I >>>>>> know >>>>>> in the past we've decided to allow it. It goes against the spirit of >>>>>> the >>>>>> project, but it doesn't require us to distribute close sourced >>>>>> software in >>>>>> the case of a fork. >>>>>> >>>>>> My personal opinion is that your code should be in a public >>>>>> repository to >>>>>> encourage collaboration. As the terms are written, though, your code >>>>>> is >>>>>> required to be open source, and any libraries it depends on must be >>>>>> as well. >>>>>> >>>>>> - Ryan >>>>>> -------------- next part -------------- >>>>>> An HTML attachment was scrubbed... >>>>>> URL: < >>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html >>>>>> > >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> Message: 5 >>>>>> Date: Fri, 13 Mar 2015 11:29:47 -0700 >>>>>> From: Pine W <[email protected]> >>>>>> To: Wikimedia Labs <[email protected]> >>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use >>>>>> Message-ID: >>>>>> <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV= >>>>>> [email protected]> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> Question: are there heightened security or privacy risks posed by >>>>>> having >>>>>> non-open-source code running in Labs? >>>>>> >>>>>> Is anyone proactively auditing Labs software for open source >>>>>> compliance, >>>>>> and if not, should this be done? >>>>>> >>>>>> Pine >>>>>> On Mar 13, 2015 10:52 AM, "Ryan Lane" <[email protected]> wrote: >>>>>> >>>>>> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa < >>>>>> > [email protected]> wrote: >>>>>> > >>>>>> >> From >>>>>> https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use >>>>>> >> (verbatim): "Do not use or install any software unless the >>>>>> software is >>>>>> >> licensed under an Open Source license". >>>>>> >> What about tools and services made up of software themselves? Do >>>>>> they >>>>>> >> have to be Open Source? >>>>>> >> Strictly speaking, do the Terms of use require that all code be >>>>>> made >>>>>> >> available to the public? >>>>>> >> Thanks in advance. >>>>>> >> >>>>>> >> >>>>>> > As the person who wrote the initial terms and included this I can >>>>>> speak to >>>>>> > the spirit of the term (I'm not a lawyer, so I won't try to go into >>>>>> any >>>>>> > legal issues). >>>>>> > >>>>>> > I created Labs with the intent that it could be used as a mechanism >>>>>> to >>>>>> > fork the projects as a whole, if necessary. A means to this end was >>>>>> > including non-WMF employees in the process of infrastructure >>>>>> operations >>>>>> > (which is outside the goals of the tools project in Labs). >>>>>> Tools/services >>>>>> > that are can't be distributed publicly harm that goal. >>>>>> Tools/services that >>>>>> > aren't open source completely break that goal. It's fine if you >>>>>> wish to not >>>>>> > maintain the code in a public git repo, but if another tool >>>>>> maintainer >>>>>> > wishes to publish your code, there should be nothing blocking that. >>>>>> > >>>>>> > Depending on external closed source services is a debatable topic. >>>>>> I know >>>>>> > in the past we've decided to allow it. It goes against the spirit >>>>>> of the >>>>>> > project, but it doesn't require us to distribute close sourced >>>>>> software in >>>>>> > the case of a fork. >>>>>> > >>>>>> > My personal opinion is that your code should be in a public >>>>>> repository to >>>>>> > encourage collaboration. As the terms are written, though, your >>>>>> code is >>>>>> > required to be open source, and any libraries it depends on must be >>>>>> as well. >>>>>> > >>>>>> > - Ryan >>>>>> > >>>>>> > _______________________________________________ >>>>>> > Labs-l mailing list >>>>>> > [email protected] >>>>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l >>>>>> > >>>>>> > >>>>>> -------------- next part -------------- >>>>>> An HTML attachment was scrubbed... >>>>>> URL: < >>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html >>>>>> > >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> _______________________________________________ >>>>>> Labs-l mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/labs-l >>>>>> >>>>>> >>>>>> End of Labs-l Digest, Vol 39, Issue 13 >>>>>> ************************************** >>>>>> >>>>> >>>>> >>>> >>> >> >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
