Hello team, I am new to working with nutch.
I had a task of extracting the different image mime sub types and image urls by Crawling through a list of urls - my approach for this task is as below : a) for Image URLS : Aftter crawling with nutch, Use nutchpy sequence reader to read from the segments dataset(/segments/content/data) and write a python script to extract out the image urls embedded within various tags Is my approach correct?? or is there any better way of doing it? I could not find any other dataset created by nutch which is having this information. b) for mime types : In the same segment dataset search for type and get all subtypes under image type. Is this the correct way of doing it? Is there a better approach than this? It would be really helpful if i could get some pointers to resources for solving the above task. Thanks, Charan

