Hi Kamil,
> 1. json-indexer: indexes documents in json lines format
Sounds good. There's already an indexer-csv (works only in local mode).
> 2. selenium extracts the html tag vs the body tag
Definitely makes sense.
> I am hesitant about this change because it could have bigger effects.
In doubt, could add a new method and make it configurable whether inner or outer
HTML is returned.
3. Add ability to extract meta tags with "property" attribute
+1
(should be applied also to parse-tika - there's some duplicated code between
parse-html and parse-tika)
> 4. Allow selenium to handle gzip content
Definitely. However, I wounder whether this couldn't be delegated to one of the
other HTTP protocol plugins. But I agree this might be tricky:
- calling a plugin from another one
- maybe better in the way discussed in the user list:
send a HEAD request and decide which way to go given on the Content-Type
response header
> 5. Treat RSS feeds as normal webpages by adding links to next segment fetch
This is actually done if you let parse-tika parse the feeds. The "feed" plugin
is very special in this respect. It takes every feed item as one document and
forwards it to the index. This is a different use case. I'm open for
discussions, however.
Thanks for your contributions!
Best,
Sebastian
On 1/20/23 16:28, Kamil Mroczek wrote:
Hello,
I have a few improvements to Nutch that I would like to get feedback on whether
this community thinks I should submit them to the main branch. Once I get my
first PR approved I can start to add these. Some of these might not be good
ideas as well so happy to hear that feedback.
1. json-indexer: indexes documents in json lines format
2. selenium extracts the html tag vs the body tag (sample commit
<https://github.com/Elio-Earth/nutch/commit/0e2aece37e3a8908221a50ee5803b8d3edd5a33e>): I needed this to extract the title of the page since that often lives in the head tag. I am hesitant about this change because it could have bigger effects.
3. Add ability to extract meta tags with "property" attribute (sample commit
<https://github.com/Elio-Earth/nutch/commit/fa23f28f66a00f9048e72a0481c29ba01fc9af56>).
4. Allow selenium to handle gzip content (sample commit
<https://github.com/Elio-Earth/nutch/commit/611387097a2f38e345e397bf32d515cea22729fc>): This is a port of the code from HTMLUnit that does the same thing. I needed this to process RSS feeds properly.
5. Treat RSS feeds as normal webpages by adding links to next segment fetch
(sample commit
<https://github.com/Elio-Earth/nutch/commit/1f8e4a9ee11de3c1a95926dbe279c0258d4c4438>)
Kamil