Re: [Labs-l] Labs-l Digest, Vol 39, Issue 13

John Fri, 13 Mar 2015 12:12:27 -0700

OK, No offense here, your not being helpful. Are you pulling from a
category, using page text or some format already in the database? The
reason I am asking is that depending on how you are selecting the articles,
it might be possible to match query the database in a manor that optimizes
the process and makes the overall time drop drastically. There have been a
few quires that I have played with in the past that originally took hours
or even days, and we where able to get them down to  a few minutes. However
without more information the vague data we are given so far isn't that
helpful.


On Fri, Mar 13, 2015 at 3:01 PM, Marc Miquel <[email protected]> wrote:

> I get them according to some selection I do according to other parameters
> more related to the content. The selection of these 300000, which could be
> either 30000 or even 500000 for other cases (like german wiki) is not an
> issue. The link analysis to see if these 300.000 receive links from another
> group of articles is my concern...
>
> Marc
> ᐧ
>
> 2015-03-13 19:56 GMT+01:00 John <[email protected]>:
>
>> where are you getting the list of 300k pages from? I want to get a feel
>> for the kinds of queries your running so that we can optimize the process
>> for you.
>>
>> On Fri, Mar 13, 2015 at 2:53 PM, Marc Miquel <[email protected]>
>> wrote:
>>
>>> I load from a file "page_titles" and "page_ids" and put them in a
>>> dictionary. One option I haven't used would be putting than into a database
>>> and INNER Joining with the pagelinks table to just obtain the links for
>>> those articles. Still, if the list is 300.000, even this is just 20% of the
>>> database, it is still a lot.
>>>
>>> Marc
>>> ᐧ
>>>
>>> 2015-03-13 19:51 GMT+01:00 John <[email protected]>:
>>>
>>>> Where are you getting your list of pages from?
>>>>
>>>> On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>>
>>>>> My queries are to obtain "inlinks" and "outlinks" for some articles I
>>>>> have in a group (x). Then I check (using python) if they have inlinks and
>>>>> outlinks from another group of articles. By now I am doing a query for 
>>>>> each
>>>>> article. I wanted to obtain all links for group (x) and then do this
>>>>> comprovation....But getting all links for groups as big as 300000 articles
>>>>> would imply 6 million links. Is it possible to obtain all this or is there
>>>>> a MySQL/RAM limit?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Marc
>>>>>
>>>>> ᐧ
>>>>>
>>>>> 2015-03-13 19:29 GMT+01:00 <[email protected]>:
>>>>>
>>>>>> Send Labs-l mailing list submissions to
>>>>>>         [email protected]
>>>>>>
>>>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>>>         https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>>> or, via email, send a message with subject or body 'help' to
>>>>>>         [email protected]
>>>>>>
>>>>>> You can reach the person managing the list at
>>>>>>         [email protected]
>>>>>>
>>>>>> When replying, please edit your Subject line so it is more specific
>>>>>> than "Re: Contents of Labs-l digest..."
>>>>>>
>>>>>>
>>>>>> Today's Topics:
>>>>>>
>>>>>>    1. dimension well my queries for very large tables like
>>>>>>       pagelinks - Tool Labs (Marc Miquel)
>>>>>>    2. Re: dimension well my queries for very large tables like
>>>>>>       pagelinks - Tool Labs (John)
>>>>>>    3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)
>>>>>>    4. Re: Questions regarding the Labs Terms of use (Ryan Lane)
>>>>>>    5. Re: Questions regarding the Labs Terms of use (Pine W)
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>>
>>>>>> Message: 1
>>>>>> Date: Fri, 13 Mar 2015 17:59:09 +0100
>>>>>> From: Marc Miquel <[email protected]>
>>>>>> To: "[email protected]" <[email protected]>
>>>>>> Subject: [Labs-l] dimension well my queries for very large tables like
>>>>>>         pagelinks - Tool Labs
>>>>>> Message-ID:
>>>>>>         <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=
>>>>>> [email protected]>
>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>>
>>>>>> Hello guys,
>>>>>>
>>>>>> I have a question regarding Tool Labs. I am doing research on links
>>>>>> and
>>>>>> although I know very well what I am looking for I struggle in how to
>>>>>> get it
>>>>>> effectively...
>>>>>>
>>>>>> I need to know your opinion because you know very well the system and
>>>>>> what's feasible and what is not.
>>>>>>
>>>>>> I explain you what I need to do:
>>>>>> I have a list of articles for different languages which I need to
>>>>>> check
>>>>>> their pagelinks and see where they point to and from where they point
>>>>>> at
>>>>>> them.
>>>>>>
>>>>>> I now do a query for each article id in this list of articles, which
>>>>>> goes
>>>>>> from 80000 in some wikipedias to 300000 in other and more. I have to
>>>>>> do it
>>>>>> several times and it is very time consuming (several days). I wish I
>>>>>> could
>>>>>> only count the total of links for each case but I need to see only
>>>>>> some of
>>>>>> the links per article.
>>>>>>
>>>>>> I was thinking about getting all pagelinks and iterating using python
>>>>>> (which is the language I use for all this). This would be much faster
>>>>>> because I'd save all the queries, one per article, I am doing now. But
>>>>>> pagelinks table has millions of rows and I cannot load that because
>>>>>> mysql
>>>>>> would die. I could buffer, but I haven't tried if it works also.
>>>>>>
>>>>>> I am considering creating a personal table in the database with
>>>>>> titles,
>>>>>> ids, and inner joining to just obtain the pagelinks for these 300.000
>>>>>> articles. With this I would just retrieve 20% of the database instead
>>>>>> of
>>>>>> the 100%. That would be maybe 8M rows sometimes (page_title or
>>>>>> page_id, one
>>>>>> of both per row), or even more... loaded into python dictionaries and
>>>>>> lists. Would that be a problem...? I have no idea of how much RAM this
>>>>>> implies and how much I can use in Tool labs.
>>>>>>
>>>>>> I am totally lost when I get these problems related to scale...I
>>>>>> thought
>>>>>> about writing to the IRC channel but I thought it was maybe too long
>>>>>> and
>>>>>> too specific. If you give me any hint that would really help.
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Marc Miquel
>>>>>> ᐧ
>>>>>> -------------- next part --------------
>>>>>> An HTML attachment was scrubbed...
>>>>>> URL: <
>>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html
>>>>>> >
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> Message: 2
>>>>>> Date: Fri, 13 Mar 2015 13:07:20 -0400
>>>>>> From: John <[email protected]>
>>>>>> To: Wikimedia Labs <[email protected]>
>>>>>> Subject: Re: [Labs-l] dimension well my queries for very large tables
>>>>>>         like pagelinks - Tool Labs
>>>>>> Message-ID:
>>>>>>         <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=
>>>>>> [email protected]>
>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>>
>>>>>> what kind of queries are you doing? odds are they can be optimized.
>>>>>>
>>>>>> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> > Hello guys,
>>>>>> >
>>>>>> > I have a question regarding Tool Labs. I am doing research on links
>>>>>> and
>>>>>> > although I know very well what I am looking for I struggle in how
>>>>>> to get it
>>>>>> > effectively...
>>>>>> >
>>>>>> > I need to know your opinion because you know very well the system
>>>>>> and
>>>>>> > what's feasible and what is not.
>>>>>> >
>>>>>> > I explain you what I need to do:
>>>>>> > I have a list of articles for different languages which I need to
>>>>>> check
>>>>>> > their pagelinks and see where they point to and from where they
>>>>>> point at
>>>>>> > them.
>>>>>> >
>>>>>> > I now do a query for each article id in this list of articles,
>>>>>> which goes
>>>>>> > from 80000 in some wikipedias to 300000 in other and more. I have
>>>>>> to do it
>>>>>> > several times and it is very time consuming (several days). I wish
>>>>>> I could
>>>>>> > only count the total of links for each case but I need to see only
>>>>>> some of
>>>>>> > the links per article.
>>>>>> >
>>>>>> > I was thinking about getting all pagelinks and iterating using
>>>>>> python
>>>>>> > (which is the language I use for all this). This would be much
>>>>>> faster
>>>>>> > because I'd save all the queries, one per article, I am doing now.
>>>>>> But
>>>>>> > pagelinks table has millions of rows and I cannot load that because
>>>>>> mysql
>>>>>> > would die. I could buffer, but I haven't tried if it works also.
>>>>>> >
>>>>>> > I am considering creating a personal table in the database with
>>>>>> titles,
>>>>>> > ids, and inner joining to just obtain the pagelinks for these
>>>>>> 300.000
>>>>>> > articles. With this I would just retrieve 20% of the database
>>>>>> instead of
>>>>>> > the 100%. That would be maybe 8M rows sometimes (page_title or
>>>>>> page_id, one
>>>>>> > of both per row), or even more... loaded into python dictionaries
>>>>>> and
>>>>>> > lists. Would that be a problem...? I have no idea of how much RAM
>>>>>> this
>>>>>> > implies and how much I can use in Tool labs.
>>>>>> >
>>>>>> > I am totally lost when I get these problems related to scale...I
>>>>>> thought
>>>>>> > about writing to the IRC channel but I thought it was maybe too
>>>>>> long and
>>>>>> > too specific. If you give me any hint that would really help.
>>>>>> >
>>>>>> > Thank you very much!
>>>>>> >
>>>>>> > Cheers,
>>>>>> >
>>>>>> > Marc Miquel
>>>>>> > ᐧ
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Labs-l mailing list
>>>>>> > [email protected]
>>>>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>>> >
>>>>>> >
>>>>>> -------------- next part --------------
>>>>>> An HTML attachment was scrubbed...
>>>>>> URL: <
>>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html
>>>>>> >
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> Message: 3
>>>>>> Date: Fri, 13 Mar 2015 17:36:00 +0000
>>>>>> From: Tim Landscheidt <[email protected]>
>>>>>> To: [email protected]
>>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>>>> Message-ID: <[email protected]>
>>>>>> Content-Type: text/plain
>>>>>>
>>>>>> (anonymous) wrote:
>>>>>>
>>>>>> > [...]
>>>>>>
>>>>>> > To be clear: I'm not going to make my code proprietary in
>>>>>> > any way. I just wanted to know whether I'm entitled to ask
>>>>>> > for the source of every Labs bot ;-)
>>>>>>
>>>>>> Everyone is entitled to /ask/, but I don't think you have a
>>>>>> right to /receive/ the source :-).
>>>>>>
>>>>>> AFAIK, there are two main reasons for the clause:
>>>>>>
>>>>>> a) WMF doesn't want to have to deal with individual licences
>>>>>>    that may or may not have the potential for litigation
>>>>>>    ("The Software shall be used for Good, not Evil").  With
>>>>>>    requiring OSI-approved, tried and true licences, the risk
>>>>>>    is negligible.
>>>>>>
>>>>>> b) Bots and tools running on an infrastructure financed by
>>>>>>    donors, like contributions to Wikipedia & Co., shouldn't
>>>>>>    be usable for blackmail.  Noone should be in a legal po-
>>>>>>    sition to demand something "or else ..."  The perpetuity
>>>>>>    of OS licences guarantees that everyone can be truly
>>>>>>    thankful to developers without having to fear that other-
>>>>>>    wise they shut down devices, delete content, etc.
>>>>>>
>>>>>> But the nice thing about collaboratively developed open
>>>>>> source software is that it usually is of a better quality,
>>>>>> so clandestine code is often not that interesting.
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> Message: 4
>>>>>> Date: Fri, 13 Mar 2015 11:52:18 -0600
>>>>>> From: Ryan Lane <[email protected]>
>>>>>> To: Wikimedia Labs <[email protected]>
>>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>>>> Message-ID:
>>>>>>         <
>>>>>> calkgca3lv-sqoeibesm7ckc0gapjwph_b0hstx+actakmdu...@mail.gmail.com>
>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>>
>>>>>> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>>>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>>>>>> > (verbatim): "Do not use or install any software unless the software
>>>>>> is
>>>>>> > licensed under an Open Source license".
>>>>>> > What about tools and services made up of software themselves? Do
>>>>>> they have
>>>>>> > to be Open Source?
>>>>>> > Strictly speaking, do the Terms of use require that all code be made
>>>>>> > available to the public?
>>>>>> > Thanks in advance.
>>>>>> >
>>>>>> >
>>>>>> As the person who wrote the initial terms and included this I can
>>>>>> speak to
>>>>>> the spirit of the term (I'm not a lawyer, so I won't try to go into
>>>>>> any
>>>>>> legal issues).
>>>>>>
>>>>>> I created Labs with the intent that it could be used as a mechanism
>>>>>> to fork
>>>>>> the projects as a whole, if necessary. A means to this end was
>>>>>> including
>>>>>> non-WMF employees in the process of infrastructure operations (which
>>>>>> is
>>>>>> outside the goals of the tools project in Labs). Tools/services that
>>>>>> are
>>>>>> can't be distributed publicly harm that goal. Tools/services that
>>>>>> aren't
>>>>>> open source completely break that goal. It's fine if you wish to not
>>>>>> maintain the code in a public git repo, but if another tool maintainer
>>>>>> wishes to publish your code, there should be nothing blocking that.
>>>>>>
>>>>>> Depending on external closed source services is a debatable topic. I
>>>>>> know
>>>>>> in the past we've decided to allow it. It goes against the spirit of
>>>>>> the
>>>>>> project, but it doesn't require us to distribute close sourced
>>>>>> software in
>>>>>> the case of a fork.
>>>>>>
>>>>>> My personal opinion is that your code should be in a public
>>>>>> repository to
>>>>>> encourage collaboration. As the terms are written, though, your code
>>>>>> is
>>>>>> required to be open source, and any libraries it depends on must be
>>>>>> as well.
>>>>>>
>>>>>> - Ryan
>>>>>> -------------- next part --------------
>>>>>> An HTML attachment was scrubbed...
>>>>>> URL: <
>>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html
>>>>>> >
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> Message: 5
>>>>>> Date: Fri, 13 Mar 2015 11:29:47 -0700
>>>>>> From: Pine W <[email protected]>
>>>>>> To: Wikimedia Labs <[email protected]>
>>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>>>> Message-ID:
>>>>>>         <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=
>>>>>> [email protected]>
>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>>
>>>>>> Question: are there heightened security or privacy risks posed by
>>>>>> having
>>>>>> non-open-source code running in Labs?
>>>>>>
>>>>>> Is anyone proactively auditing Labs software for open source
>>>>>> compliance,
>>>>>> and if not, should this be done?
>>>>>>
>>>>>> Pine
>>>>>> On Mar 13, 2015 10:52 AM, "Ryan Lane" <[email protected]> wrote:
>>>>>>
>>>>>> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>>>>>> > [email protected]> wrote:
>>>>>> >
>>>>>> >> From
>>>>>> https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>>>>>> >> (verbatim): "Do not use or install any software unless the
>>>>>> software is
>>>>>> >> licensed under an Open Source license".
>>>>>> >> What about tools and services made up of software themselves? Do
>>>>>> they
>>>>>> >> have to be Open Source?
>>>>>> >> Strictly speaking, do the Terms of use require that all code be
>>>>>> made
>>>>>> >> available to the public?
>>>>>> >> Thanks in advance.
>>>>>> >>
>>>>>> >>
>>>>>> > As the person who wrote the initial terms and included this I can
>>>>>> speak to
>>>>>> > the spirit of the term (I'm not a lawyer, so I won't try to go into
>>>>>> any
>>>>>> > legal issues).
>>>>>> >
>>>>>> > I created Labs with the intent that it could be used as a mechanism
>>>>>> to
>>>>>> > fork the projects as a whole, if necessary. A means to this end was
>>>>>> > including non-WMF employees in the process of infrastructure
>>>>>> operations
>>>>>> > (which is outside the goals of the tools project in Labs).
>>>>>> Tools/services
>>>>>> > that are can't be distributed publicly harm that goal.
>>>>>> Tools/services that
>>>>>> > aren't open source completely break that goal. It's fine if you
>>>>>> wish to not
>>>>>> > maintain the code in a public git repo, but if another tool
>>>>>> maintainer
>>>>>> > wishes to publish your code, there should be nothing blocking that.
>>>>>> >
>>>>>> > Depending on external closed source services is a debatable topic.
>>>>>> I know
>>>>>> > in the past we've decided to allow it. It goes against the spirit
>>>>>> of the
>>>>>> > project, but it doesn't require us to distribute close sourced
>>>>>> software in
>>>>>> > the case of a fork.
>>>>>> >
>>>>>> > My personal opinion is that your code should be in a public
>>>>>> repository to
>>>>>> > encourage collaboration. As the terms are written, though, your
>>>>>> code is
>>>>>> > required to be open source, and any libraries it depends on must be
>>>>>> as well.
>>>>>> >
>>>>>> > - Ryan
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Labs-l mailing list
>>>>>> > [email protected]
>>>>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>>> >
>>>>>> >
>>>>>> -------------- next part --------------
>>>>>> An HTML attachment was scrubbed...
>>>>>> URL: <
>>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html
>>>>>> >
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Labs-l mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>>>
>>>>>>
>>>>>> End of Labs-l Digest, Vol 39, Issue 13
>>>>>> **************************************
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] Labs-l Digest, Vol 39, Issue 13

Reply via email to