Re: Basic code organization questions + scheduling

Chris K Wensel Mon, 08 Sep 2008 07:20:53 -0700

If you wrote a simple URL fetcher function for Cascading, you wouldhave a very powerful web crawler that would dwarf Nutch in flexibility.

That said, Nutch is optimized for storage, has supporting tools,ranking algorithms, and has been up against some nasty html and otherdocument types. building a really robust crawler is non-trivial.

If i was just starting out and needed to implement a proprietaryprocess, I would use Nutch for fetching raw content, and refreshingit. then use Cascading for parsing, indexing, etc.


cheers,
chris

On Sep 8, 2008, at 12:42 AM, tarjei wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Alex (and others).
You should take a look at Nutch. It's a search-engine built onLucene,
though it can be setup on top of Hadoop.  Take a look:
This didn't help me much. Although the description I gave of the basic
flow of the app seems to be close to what Nutch is doing (and I'vebeen
looking at the Nutch code), the questions are more general and not
related to indexing as such, but about code organization. If someonehas
more input to those, feel free to add it.
On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse <[EMAIL PROTECTED]> wrote:
Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
tasks. The basic flow is

input:    array of urls
actions:          |
1.              get pages
                    |
2.          extract new urls from pages -> start new job
          extract text  -> index / filter (as new jobs)
What I'm considering is how I should build this application to fitinto themap/reduce context. I'm thinking that step 1 and 2 should beseparate
map/reduce tasks that then pipe things on to the next step.
This is where I am a bit at loss to see how it is smart toorganize thecode in logical units and also how to spawn new tasks when an oldone is
over.
Is the usual way to control the flow of a set of tasks to have anexternalapplication running that listens to jobs ending via theendNotificationUriand then spawns new tasks or should the job itself contain code tocreate
new jobs? Would it be a good idea to use Cascading here?

I'm also considering how I should do job scheduling (I got a lot of
reoccurring tasks). Has anyone found a good framework for jobcontrol of
reoccurring tasks or should I plan to build my own using quartz ?
Any tips/best practices with regard to the issues described aboveare mostwelcome. Feel free to ask further questions if you find mydescriptions of
the issues lacking.
Kind regards,
Tarjei
Kind regards,
Tarjei
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o
9km8MTJcTQxnc7bijR1Oxs0=
=79fZ
-----END PGP SIGNATURE-----


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/

Re: Basic code organization questions + scheduling

Reply via email to