Hello Lewis,

1. counters, for me they are a requirement to have as they are key to regular 
inspections of ongoing crawls, finding errors and debugging. I hope you can 
find a work around.

2. sounds interesting, but i'd like to see the test run with 12M rather than 
12k URLs.

A question, are the produced files with Tez compatible with MapReduce programs, 
map and sequence files? It would be a tremendous advantage if existing programs 
can work with it. It would be a real pain to have to rewrite all code in one 
go. We have seen that lead to a dead end many times, including our 2.x-branch.

Have a nice evening!
Markus

 
 
-----Original message-----
> From:Lewis John McGibbney <[email protected]>
> Sent: Monday 21st December 2020 21:40
> To: [email protected]
> Subject: Re: [DISCUSS] Replacing MapReduce with Tez
> 
> Hi dev@,
> Short update here. I've documented my initial observations running Nutch on 
> Tez at https://s.apache.org/viee3
> Specific early finding are as follows
> 1. Counters don't appear to work... which makes sense as all existing 
> counters are manifested using the MapReduce framework. I'm not sure if Tez 
> has a similar/equivalent concept of counters but I am working to find out 
> more.
> 2. So far running some basic experiments using the Injector job on around 
> ~12k URLs, I've observed the following
> - When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the 
> following runtimes
>   * 1st run: elapsed: 00:00:42
>   * 2nd run: elapsed: 00:00:13
>   * 3rd run: elapsed: 00:00:14
> 
> - When 'mapreduce.framework.name' is set to 'yarn' I am observing the 
> following runtimes
>   * 1st run: elapsed: 00:00:34
>   * 2nd run: elapsed: 00:00:32
>   * 3rd run: elapsed: 00:00:34
> 
> So after the first run, it looks like running the Injector job on Tez results 
> in a dramatic runtime improvement.
> 
> As I mentioned in the Tez thread, I'm going to document all of this on the 
> Nutch wiki. I also plan to  continue my evaluation over the holidays and will 
> report back here when I have more information. 
> 
> Thanks
> 
> On 2020/12/10 07:46:30, lewis john mcgibbney <[email protected]> wrote: 
> > Hi dev@,
> > A while ago I had thought about bringing this topic up... I then got
> > busy... for ages. I'll therefore get straight to the point.
> > Has anyone on the dev@ team had an experience using Apache Tez -
> > tez.apache.org?
> > Tez promises multiple improvements over MapReduce. Naturally I wondered
> > whether the Nutch project is at a stage of maturity now that we would look
> > to leverage something more performant than legacy MapReduce.
> > Were we to consider evolving Nutch by re-architecting it to use Tez as the
> > processing engine, this would be a significant work effort.
> > I just wanted to throw this out there for some blue-sky feedback.
> > Thanks
> > lewismc
> > 
> > -- 
> > http://home.apache.org/~lewismc/
> > http://people.apache.org/keys/committer/lewismc
> > 
> 

Reply via email to