RE: Getting Whole HTML?

Markus Jelsma Tue, 05 Jul 2016 15:05:25 -0700

Trevor - i am unfamiliar with 2.x but it should be possible to get it out of 
HBase with ease right? Nutch stores it in the webtable. Nutch 1.x allows you to 
index raw HTML as-is, so you can fetch it from Solr or Elasticsearch.


Markus

 
 
-----Original message-----
> From:Trevor Oakley <[email protected]>
> Sent: Monday 4th July 2016 19:42
> To: [email protected]
> Subject: Getting Whole HTML?
> 
> We are using nutch 2.3 to extract data into elasticsearch (1.7) and using 
> hbase 0.94.27. The system all works fine for text and the html is stored in 
> hbase but we cannot extract it. 
> We tried dump and a few options but nothing worked so far. 
> Has anyone any ideas?

RE: Getting Whole HTML?

Reply via email to