Do you have any idea how to extract from command line all my html files
stored in the db?

Dennis Kubes-2 wrote:
> 
> Pulling out specific information for each site could be done through 
> HtmlParseFilter implementations.  Look at 
> org.apache.nutch.parse.HtmlParseFilter and its implementations.  The 
> specific fields you extract can be stored in MetaData in ParseData.  You 
> can then access that information in other jobs such as indexer.  Hope 
> this helps.
> 
> Dennis Kubes
> 
> LoneEagle70 wrote:
>> I do not want it using the WebApp.
>> 
>> Is there a way to extract all html files from command line in a
>> directory?
>> Like displaying stats. I tried the dump but was not what I wanted. I
>> really
>> want only html pages so I can take information from them.
>> 
>> Here my problem: We are looking for a program that will do Web Crawling
>> but
>> must be customized for each site that we need because those pages are
>> generated based on parameters. Also, we need to extract information
>> (product, price, manufacturer, ...). So, if you have experience with
>> Nutch,
>> you could help me out. Can I customized it through Hooks? What can/can't
>> I
>> do?
>> 
>> Thanks for your help! :)
>> 
>> Dennis Kubes-2 wrote:
>>> It depends on what you are trying to do.  Content in segments stores the 
>>> full content (html, etc.) of each page.  The cached.jsp page displays 
>>> full content.
>>>
>>> Dennis Kubes
>>>
>>>
>>> LoneEagle70 wrote:
>>>> Hi,
>>>>
>>>> I was able to install Nutch 0.9 and crawl a site and use the Web Page
>>>> to
>>>> do
>>>> full text search of my db.
>>>>
>>>> But we need to extract informations from all HTML page.
>>>>
>>>> So, is there a way to extract HTML pages from the db?
>>>
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Extracting-html-pages-from-db-tf4640373.html#a13258870
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to