Hi

Could you post more details from the logs?  Maybe you can this command to
check the parser first. [0]

bin/nutch plugin Parser org.apache.nutch.parse.ParserChecker
www.epingsoft.com/epub/examples/AChristmasCarol.epub

[0] http://wiki.apache.org/nutch/bin/nutch%20plugin


On Tue, May 14, 2013 at 1:14 PM, mahodaya <[email protected]> wrote:

> Hi
>
> my requirement is to extract the contents of epub files using apache nutch
> and solr. In my nutch-site.xml file i have included "epub" format in
> pugin.includes property and in regex-urlfilter.txt accepted everything with
> this syntax ".+"  and i have included parse- tika plugin in
> parse-plugins.xml.
>
> I am giving this url www.epingsoft.com/epub/examples/AChristmasCarol.epubin
> seed.txt of url directory.
>
> I am using following commands to get the contents
>
> bin/nutch crawl urls -solr http://localhost:8983/solr/
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
> crawl/linkdb crawl/segments/*
>
> but when i try to view the result using solr it display only url of the
> file
> as follows
>
> www.epingsoft.com/epub/examples/AChristmasCarol.epub/AChristmasCarol
> AChristmasCarol AChristmasCarol
> www.epingsoft.com/epub/examples/AChristmasCarol.epub AChristmasCarol
> www.epingsoft.com/epub/examples/AChristmasCarol.epub
>
>
> please help me how can i get the actual contents of the epub file
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-parse-epub-files-using-plugin-parse-tika-tp4063137.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to