Hi 

my requirement is to extract the contents of epub files using apache nutch
and solr. In my nutch-site.xml file i have included "epub" format in
pugin.includes property and in regex-urlfilter.txt accepted everything with
this syntax ".+"  and i have included parse- tika plugin in
parse-plugins.xml.

I am giving this url www.epingsoft.com/epub/examples/AChristmasCarol.epub in
seed.txt of url directory.

I am using following commands to get the contents

bin/nutch crawl urls -solr http://localhost:8983/solr/  

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb crawl/segments/* 

but when i try to view the result using solr it display only url of the file
as follows

www.epingsoft.com/epub/examples/AChristmasCarol.epub/AChristmasCarol
AChristmasCarol AChristmasCarol 
www.epingsoft.com/epub/examples/AChristmasCarol.epub AChristmasCarol
www.epingsoft.com/epub/examples/AChristmasCarol.epub


please help me how can i get the actual contents of the epub file 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-parse-epub-files-using-plugin-parse-tika-tp4063137.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to