There is currently no way to do that. You would need to write a map job
to pull the data from Content within Segments.
Dennis Kubes
LoneEagle70 wrote:
Do you have any idea how to extract from command line all my html files
stored in the db?
Dennis Kubes-2 wrote:
Pulling out specific information for each site could be done through
HtmlParseFilter implementations. Look at
org.apache.nutch.parse.HtmlParseFilter and its implementations. The
specific fields you extract can be stored in MetaData in ParseData. You
can then access that information in other jobs such as indexer. Hope
this helps.
Dennis Kubes
LoneEagle70 wrote:
I do not want it using the WebApp.
Is there a way to extract all html files from command line in a
directory?
Like displaying stats. I tried the dump but was not what I wanted. I
really
want only html pages so I can take information from them.
Here my problem: We are looking for a program that will do Web Crawling
but
must be customized for each site that we need because those pages are
generated based on parameters. Also, we need to extract information
(product, price, manufacturer, ...). So, if you have experience with
Nutch,
you could help me out. Can I customized it through Hooks? What can/can't
I
do?
Thanks for your help! :)
Dennis Kubes-2 wrote:
It depends on what you are trying to do. Content in segments stores the
full content (html, etc.) of each page. The cached.jsp page displays
full content.
Dennis Kubes
LoneEagle70 wrote:
Hi,
I was able to install Nutch 0.9 and crawl a site and use the Web Page
to
do
full text search of my db.
But we need to extract informations from all HTML page.
So, is there a way to extract HTML pages from the db?