Including [email protected] as not all of you are on the Nutch lists ;-) Julien
---------- Forwarded message ---------- From: Julien Nioche <[email protected]> Date: 16 September 2013 17:43 Subject: Re: 2.x vs. 1.x speed To: "[email protected]" <[email protected]>, "[email protected]" <[email protected]> Cc: Otis Gospodnetic <[email protected]> Guys, Following the discussion we had some time ago about comparing 1.x with 2.x, we did dome tests and put the results on http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html Feel free to comment. Best, Julien On 24 August 2013 05:51, Lewis John Mcgibbney <[email protected]>wrote: > I am sure that Renato (if he is watching) can plugin maybe as well. > We find in Gora that in every sense of the word, native Hadoop stores such > as Avro, HBase and Accumulo when we execute a query with GiraInputFormat > via getParitions we retrieve GoraInputSplits natively which means splits > are obtained for MapReduce jobs... such as many of the jobs we run in Nutch > as well. On the other hand (currently) stores such as Cassandra and Web > service stores such as DynamoDB do not support Hadoop out of the box (the > former we are working on and hope to have implemented in Gora soon) > therefore it is not as simple to get partitions in the same way we would in > a Hadoop native store. We therefore obtain one partition to be used as an > InputSplit for the MR job. This is certainly an area for concern and right > now a bottleneck for some operations. We continue to work on this. > > > On Wednesday, August 7, 2013, Julien Nioche <[email protected] > > > wrote: > > Hi Otis > > > > Definitely *not *the fetching speed. Actually everything but *not* the > > fetching speed. The fetcher is pretty much the same as 1.x and anyway the > > performance with fetching is pretty much always limited by the politeness > > settings, not the implementation. > > > > Re-backend : some backend implementations are more mature than others. > The > > one for HBase is probably the one most widely used, the Cassandra one has > > been greatly improved in particular performance-wise , the SQL one is > > broken etc... we need to measure this as this is just a gut feeling at > this > > stage > > > > Now for what is slower and why, again this has to be measured but I > expect > > 2.x to be slower partly because of [1], i.e. the filtering of entries is > > not done by the backends (some might provide a way of doing it) but this > is > > done on the client side, when we create the input for mapred. In other > > words we pull things from the backend just to discard it. Since 2.x does > > not have segments like 1.x (which the fetch + parse mapreduce jobs take > as > > single input) we scan the whole table even if we want to fetch or parse a > > handful of entries. > > > > On the other hand, 2.x specifies what columns to retrieve for a given > job, > > whereas 1.x will for instance deserialize the crawldatum entirely. The > > metadata objects are costly to read/write so 2.x might have the upper > hand > > from that point of view since it pulls and deserializes only what it > needs. > > > > Finally the most costly steps in a large crawl in 1.x are the generation > > and update as we have to read/write the crawldb entirely. The way the > > updates are done in 2.x is different and should be a lot faster. > > > > Please could anyone correct me if I am wrong. Some of this is based on my > > understanding of 2.x which dates back from quite a while and some of the > > stuff might have changed in the meantime. The performance would probably > > vary a lot based on the fine tuning of each backend implementation but > > having some basic comparison would confirm some of the assertions above. > > > > Julien > > > > > > [1] https://issues.apache.org/jira/browse/GORA-119 > > > > > > Julien, could you please elaborate a bit about your comment about speed > >> depending on the backend used? > >> > >> Yes, you were the person I was referring to :) > >> > >> Oh, and *believe* you said it was the fetching speed that was different > >> between 1.x and 2.x. Is that right? Or is some other phase slower in > 2.x? > >> > >> Thanks, > >> Otis > >> ---- > >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - > >> http://sematext.com/spm > >> > >> > >> > >> > >> >________________________________ > >> > From: Julien Nioche <[email protected]> > >> >To: "[email protected]" <[email protected]> > >> >Sent: Tuesday, August 6, 2013 10:54 AM > >> >Subject: Re: 2.x vs. 1.x speed > >> > > >> > > >> >Hi Otis, > >> > > >> >That certainly depends on the backend used but on the whole it wouldn't > be > >> >surprising. Would be good to have some data to substantiate it. I am > >> >planning to put my intern on the case and have some basic comparison as > >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone > else > >> >wants to do it please go ahead. > >> > > >> >In case I happen to be the person who told you that Otis, well at least > I > >> >am consistent ;-) > >> > > >> >Julien > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> >On 6 August 2013 09:08, Otis Gospodnetic <[email protected]> > >> wrote: > >> > > >> >> Hello, > >> >> > >> >> At some point earlier this year I spoke to a person who told me 2.x > is > >> >> (a little?) slower than 1.x. Is that still the case? > >> >> > >> >> Thanks, > >> >> Otis > >> >> -- > >> >> Solr & ElasticSearch Support -- http://sematext.com/ > >> >> Performance Monitoring -- http://sematext.com/spm > >> >> > >> > > >> > > >> > > >> >-- > >> >* > >> >*Open Source Solutions for Text Engineering > >> > > >> >http://digitalpebble.blogspot.com/ > >> >http://www.digitalpebble.com > >> >http://twitter.com/digitalpebble > >> > > >> > > >> > > >> > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > > -- > *Lewis* > -- * * Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

