> Now, in order to get or save the files in their actual format, in your > case, .flv or .epub files, you will have to write additional program (for > example in Java). No, you don't have to: the plugin parse-tika can parse .epub and .flv - see http://tika.apache.org/1.2/formats.html - test it, eg: % bin/nutch parsechecker http://.../book.epub
btw, please, use [email protected]! On 05/13/2013 11:17 PM, Pankaj Kumar wrote: > I think, you are doing good till now. > Nutch usually crawls the data and fetches the URLs of all the files, like > html, pdf etc in the specified directory in binary format. > Now, in order to get or save the files in their actual format, in your > case, .flv or .epub files, you will have to write additional program (for > example in Java). > > Hope this helps. > > With Regards, > Pankaj Kumar > > > > On Mon, May 13, 2013 at 6:35 AM, vicky4751 <[email protected]>wrote: > >> Hi, >> >> i am working with apache nutch and solr, my requirement is to parse the >> contents of flv and epub files, i am using below command to parse the files >> >> bin/nutch crawl urls -solr http://localhost:8983/solr/ >> >> i have kept the file urls in urls folder of nutch. the above command is >> working but when i tried to view the parsed content using solr with the >> following command its is just displaying the url of the files instead of >> its >> contents. >> >> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb >> crawl/linkdb crawl/segments/* >> >> please suggest me.... >> >> Thanks >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Unable-to-parse-flv-and-epub-file-contents-using-nutch-tp4062927.html >> Sent from the Nutch - Dev mailing list archive at Nabble.com. >> >

