I'm attempting to crawl pages with charset utf-16 and send the index to solr where it can be searched. I followed the instructions http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ here and successfully crawled and searched test content with utf-8. However, when I attempt to crawl the utf-16 content it gets sent to solr as japanese characters. The pages encoded as utf-16 contain only english text, no special characters. Is there anyway to force nutch to crawl the page as utf-8 and ignore the utf-16 setting?
Thanks. -- View this message in context: http://www.nabble.com/Nutch-crawler-charset-issues-utf-16-tp25981513p25981513.html Sent from the Nutch - User mailing list archive at Nabble.com.