Hi my requirement is to extract the contents of epub files using apache nutch and solr. In my nutch-site.xml file i have included "epub" format in pugin.includes property and in regex-urlfilter.txt accepted everything with this syntax ".+" and i have included parse- tika plugin in parse-plugins.xml.
I am giving this url www.epingsoft.com/epub/examples/AChristmasCarol.epub in seed.txt of url directory. I am using following commands to get the contents bin/nutch crawl urls -solr http://localhost:8983/solr/ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* but when i try to view the result using solr it display only url of the file as follows www.epingsoft.com/epub/examples/AChristmasCarol.epub/AChristmasCarol AChristmasCarol AChristmasCarol www.epingsoft.com/epub/examples/AChristmasCarol.epub AChristmasCarol www.epingsoft.com/epub/examples/AChristmasCarol.epub please help me how can i get the actual contents of the epub file -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-parse-epub-files-using-plugin-parse-tika-tp4063137.html Sent from the Nutch - Dev mailing list archive at Nabble.com.

