Hello,

I have a few improvements to Nutch that I would like to get feedback on
whether this community thinks I should submit them to the main branch. Once
I get my first PR approved I can start to add these. Some of these might
not be good ideas as well so happy to hear that feedback.

1. json-indexer: indexes documents in json lines format

2. selenium extracts the html tag vs the body tag (sample commit
<https://github.com/Elio-Earth/nutch/commit/0e2aece37e3a8908221a50ee5803b8d3edd5a33e>):
I needed this to extract the title of the page since that often lives in
the head tag. I am hesitant about this change because it could have bigger
effects.

3. Add ability to extract meta tags with "property" attribute (sample commit
<https://github.com/Elio-Earth/nutch/commit/fa23f28f66a00f9048e72a0481c29ba01fc9af56>
).

4. Allow selenium to handle gzip content (sample commit
<https://github.com/Elio-Earth/nutch/commit/611387097a2f38e345e397bf32d515cea22729fc>):
This is a port of the code from HTMLUnit that does the same thing. I needed
this to process RSS feeds properly.

5. Treat RSS feeds as normal webpages by adding links to next segment fetch
(sample  commit
<https://github.com/Elio-Earth/nutch/commit/1f8e4a9ee11de3c1a95926dbe279c0258d4c4438>
)


Kamil

Reply via email to