Do you have any idea how to extract from command line all my html files stored in the db?
Dennis Kubes-2 wrote: > > Pulling out specific information for each site could be done through > HtmlParseFilter implementations. Look at > org.apache.nutch.parse.HtmlParseFilter and its implementations. The > specific fields you extract can be stored in MetaData in ParseData. You > can then access that information in other jobs such as indexer. Hope > this helps. > > Dennis Kubes > > LoneEagle70 wrote: >> I do not want it using the WebApp. >> >> Is there a way to extract all html files from command line in a >> directory? >> Like displaying stats. I tried the dump but was not what I wanted. I >> really >> want only html pages so I can take information from them. >> >> Here my problem: We are looking for a program that will do Web Crawling >> but >> must be customized for each site that we need because those pages are >> generated based on parameters. Also, we need to extract information >> (product, price, manufacturer, ...). So, if you have experience with >> Nutch, >> you could help me out. Can I customized it through Hooks? What can/can't >> I >> do? >> >> Thanks for your help! :) >> >> Dennis Kubes-2 wrote: >>> It depends on what you are trying to do. Content in segments stores the >>> full content (html, etc.) of each page. The cached.jsp page displays >>> full content. >>> >>> Dennis Kubes >>> >>> >>> LoneEagle70 wrote: >>>> Hi, >>>> >>>> I was able to install Nutch 0.9 and crawl a site and use the Web Page >>>> to >>>> do >>>> full text search of my db. >>>> >>>> But we need to extract informations from all HTML page. >>>> >>>> So, is there a way to extract HTML pages from the db? >>> >> > > -- View this message in context: http://www.nabble.com/Extracting-html-pages-from-db-tf4640373.html#a13258870 Sent from the Nutch - User mailing list archive at Nabble.com.
