Nutch datasets : How to ??

Charan Shampur Wed, 23 Sep 2015 13:03:13 -0700

Hello team,

I am new to working with nutch.


I had a task of extracting the different image mime sub types and image
urls by Crawling through a list of urls - my approach for this task is as
below :

a) for Image URLS :

Aftter crawling with nutch, Use nutchpy sequence reader to read from the
segments dataset(/segments/content/data)  and write a python script to
extract out the image urls embedded within various tags

Is my approach correct?? or is there any better way of doing it? I could
not find any other dataset created by nutch which is having this
information.

b) for mime types :

In the same segment dataset search for type and get all subtypes under
image type.

Is this the correct way of doing it?  Is there a better approach than this?

It would be really helpful if i could get some pointers to resources for
solving the above task.


Thanks,

Charan

Nutch datasets : How to ??

Reply via email to