Hello, I have a few improvements to Nutch that I would like to get feedback on whether this community thinks I should submit them to the main branch. Once I get my first PR approved I can start to add these. Some of these might not be good ideas as well so happy to hear that feedback.
1. json-indexer: indexes documents in json lines format 2. selenium extracts the html tag vs the body tag (sample commit <https://github.com/Elio-Earth/nutch/commit/0e2aece37e3a8908221a50ee5803b8d3edd5a33e>): I needed this to extract the title of the page since that often lives in the head tag. I am hesitant about this change because it could have bigger effects. 3. Add ability to extract meta tags with "property" attribute (sample commit <https://github.com/Elio-Earth/nutch/commit/fa23f28f66a00f9048e72a0481c29ba01fc9af56> ). 4. Allow selenium to handle gzip content (sample commit <https://github.com/Elio-Earth/nutch/commit/611387097a2f38e345e397bf32d515cea22729fc>): This is a port of the code from HTMLUnit that does the same thing. I needed this to process RSS feeds properly. 5. Treat RSS feeds as normal webpages by adding links to next segment fetch (sample commit <https://github.com/Elio-Earth/nutch/commit/1f8e4a9ee11de3c1a95926dbe279c0258d4c4438> ) Kamil