Hello Lewis, 1. counters, for me they are a requirement to have as they are key to regular inspections of ongoing crawls, finding errors and debugging. I hope you can find a work around.
2. sounds interesting, but i'd like to see the test run with 12M rather than 12k URLs. A question, are the produced files with Tez compatible with MapReduce programs, map and sequence files? It would be a tremendous advantage if existing programs can work with it. It would be a real pain to have to rewrite all code in one go. We have seen that lead to a dead end many times, including our 2.x-branch. Have a nice evening! Markus -----Original message----- > From:Lewis John McGibbney <[email protected]> > Sent: Monday 21st December 2020 21:40 > To: [email protected] > Subject: Re: [DISCUSS] Replacing MapReduce with Tez > > Hi dev@, > Short update here. I've documented my initial observations running Nutch on > Tez at https://s.apache.org/viee3 > Specific early finding are as follows > 1. Counters don't appear to work... which makes sense as all existing > counters are manifested using the MapReduce framework. I'm not sure if Tez > has a similar/equivalent concept of counters but I am working to find out > more. > 2. So far running some basic experiments using the Injector job on around > ~12k URLs, I've observed the following > - When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the > following runtimes > * 1st run: elapsed: 00:00:42 > * 2nd run: elapsed: 00:00:13 > * 3rd run: elapsed: 00:00:14 > > - When 'mapreduce.framework.name' is set to 'yarn' I am observing the > following runtimes > * 1st run: elapsed: 00:00:34 > * 2nd run: elapsed: 00:00:32 > * 3rd run: elapsed: 00:00:34 > > So after the first run, it looks like running the Injector job on Tez results > in a dramatic runtime improvement. > > As I mentioned in the Tez thread, I'm going to document all of this on the > Nutch wiki. I also plan to continue my evaluation over the holidays and will > report back here when I have more information. > > Thanks > > On 2020/12/10 07:46:30, lewis john mcgibbney <[email protected]> wrote: > > Hi dev@, > > A while ago I had thought about bringing this topic up... I then got > > busy... for ages. I'll therefore get straight to the point. > > Has anyone on the dev@ team had an experience using Apache Tez - > > tez.apache.org? > > Tez promises multiple improvements over MapReduce. Naturally I wondered > > whether the Nutch project is at a stage of maturity now that we would look > > to leverage something more performant than legacy MapReduce. > > Were we to consider evolving Nutch by re-architecting it to use Tez as the > > processing engine, this would be a significant work effort. > > I just wanted to throw this out there for some blue-sky feedback. > > Thanks > > lewismc > > > > -- > > http://home.apache.org/~lewismc/ > > http://people.apache.org/keys/committer/lewismc > > >

