what kind of queries are you doing? odds are they can be optimized. On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <[email protected]> wrote:
> Hello guys, > > I have a question regarding Tool Labs. I am doing research on links and > although I know very well what I am looking for I struggle in how to get it > effectively... > > I need to know your opinion because you know very well the system and > what's feasible and what is not. > > I explain you what I need to do: > I have a list of articles for different languages which I need to check > their pagelinks and see where they point to and from where they point at > them. > > I now do a query for each article id in this list of articles, which goes > from 80000 in some wikipedias to 300000 in other and more. I have to do it > several times and it is very time consuming (several days). I wish I could > only count the total of links for each case but I need to see only some of > the links per article. > > I was thinking about getting all pagelinks and iterating using python > (which is the language I use for all this). This would be much faster > because I'd save all the queries, one per article, I am doing now. But > pagelinks table has millions of rows and I cannot load that because mysql > would die. I could buffer, but I haven't tried if it works also. > > I am considering creating a personal table in the database with titles, > ids, and inner joining to just obtain the pagelinks for these 300.000 > articles. With this I would just retrieve 20% of the database instead of > the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one > of both per row), or even more... loaded into python dictionaries and > lists. Would that be a problem...? I have no idea of how much RAM this > implies and how much I can use in Tool labs. > > I am totally lost when I get these problems related to scale...I thought > about writing to the IRC channel but I thought it was maybe too long and > too specific. If you give me any hint that would really help. > > Thank you very much! > > Cheers, > > Marc Miquel > ᐧ > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
