Trevor - i am unfamiliar with 2.x but it should be possible to get it out of HBase with ease right? Nutch stores it in the webtable. Nutch 1.x allows you to index raw HTML as-is, so you can fetch it from Solr or Elasticsearch.
Markus -----Original message----- > From:Trevor Oakley <[email protected]> > Sent: Monday 4th July 2016 19:42 > To: [email protected] > Subject: Getting Whole HTML? > > We are using nutch 2.3 to extract data into elasticsearch (1.7) and using > hbase 0.94.27. The system all works fine for text and the html is stored in > hbase but we cannot extract it. > We tried dump and a few options but nothing worked so far. > Has anyone any ideas?

