> Now, in order to get or save the files in their actual format, in your
> case, .flv or .epub files, you will have to write additional program (for
> example in Java).
No, you don't have to: the plugin parse-tika can parse .epub and .flv
- see http://tika.apache.org/1.2/formats.html
- test it, eg:
  % bin/nutch parsechecker http://.../book.epub

btw, please, use [email protected]!

On 05/13/2013 11:17 PM, Pankaj Kumar wrote:
> I think, you are doing good till now.
> Nutch usually crawls the data and fetches the URLs of all the files, like
> html, pdf etc in the specified directory in binary format.
> Now, in order to get or save the files in their actual format, in your
> case, .flv or .epub files, you will have to write additional program (for
> example in Java).
> 
> Hope this helps.
> 
> With Regards,
> Pankaj Kumar
> 
> 
> 
> On Mon, May 13, 2013 at 6:35 AM, vicky4751 <[email protected]>wrote:
> 
>> Hi,
>>
>> i am working with apache nutch and solr, my requirement is to parse the
>> contents of flv and epub files, i am using below command to parse the files
>>
>> bin/nutch crawl urls -solr http://localhost:8983/solr/
>>
>> i have kept the file urls in urls folder of nutch. the above command is
>> working but when i tried to view the parsed content using solr with the
>> following command its is just displaying the url of the files instead of
>> its
>> contents.
>>
>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
>> crawl/linkdb crawl/segments/*
>>
>> please suggest me....
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Unable-to-parse-flv-and-epub-file-contents-using-nutch-tp4062927.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
> 

Reply via email to