Re: Extracting html pages from db

LoneEagle70 Wed, 17 Oct 2007 10:21:01 -0700

I do not want it using the WebApp.

Is there a way to extract all html files from command line in a directory?
Like displaying stats. I tried the dump but was not what I wanted. I really
want only html pages so I can take information from them.

Here my problem: We are looking for a program that will do Web Crawling but
must be customized for each site that we need because those pages are
generated based on parameters. Also, we need to extract information
(product, price, manufacturer, ...). So, if you have experience with Nutch,
you could help me out. Can I customized it through Hooks? What can/can't I
do?

Thanks for your help! :)

Dennis Kubes-2 wrote:
> 
> It depends on what you are trying to do.  Content in segments stores the 
> full content (html, etc.) of each page.  The cached.jsp page displays 
> full content.
> 
> Dennis Kubes
> 
> 
> LoneEagle70 wrote:
>> Hi,
>> 
>> I was able to install Nutch 0.9 and crawl a site and use the Web Page to
>> do
>> full text search of my db.
>> 
>> But we need to extract informations from all HTML page.
>> 
>> So, is there a way to extract HTML pages from the db?
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Extracting-html-pages-from-db-tf4640373.html#a13258493
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Extracting html pages from db

Reply via email to