Hi Kamil,

> 1. json-indexer: indexes documents in json lines format

Sounds good. There's already an indexer-csv (works only in local mode).

> 2. selenium extracts the html tag vs the body tag

Definitely makes sense.

> I am hesitant about this change because it could have bigger effects.

In doubt, could add a new method and make it configurable whether inner or outer HTML is returned.

3. Add ability to extract meta tags with "property" attribute

+1

(should be applied also to parse-tika - there's some duplicated code between parse-html and parse-tika)

> 4. Allow selenium to handle gzip content

Definitely. However, I wounder whether this couldn't be delegated to one of the other HTTP protocol plugins. But I agree this might be tricky:
- calling a plugin from another one
- maybe better in the way discussed in the user list:
  send a HEAD request and decide which way to go given on the Content-Type
  response header

> 5. Treat RSS feeds as normal webpages by adding links to next segment fetch

This is actually done if you let parse-tika parse the feeds. The "feed" plugin is very special in this respect. It takes every feed item as one document and forwards it to the index. This is a different use case. I'm open for discussions, however.

Thanks for your contributions!

Best,
Sebastian

On 1/20/23 16:28, Kamil Mroczek wrote:
Hello,

I have a few improvements to Nutch that I would like to get feedback on whether this community thinks I should submit them to the main branch. Once I get my first PR approved I can start to add these. Some of these might not be good ideas as well so happy to hear that feedback.

1. json-indexer: indexes documents in json lines format

2. selenium extracts the html tag vs the body tag (sample commit <https://github.com/Elio-Earth/nutch/commit/0e2aece37e3a8908221a50ee5803b8d3edd5a33e>): I needed this to extract the title of the page since that often lives in the head tag. I am hesitant about this change because it could have bigger effects.

3. Add ability to extract meta tags with "property" attribute (sample commit <https://github.com/Elio-Earth/nutch/commit/fa23f28f66a00f9048e72a0481c29ba01fc9af56>).

4. Allow selenium to handle gzip content (sample commit <https://github.com/Elio-Earth/nutch/commit/611387097a2f38e345e397bf32d515cea22729fc>): This is a port of the code from HTMLUnit that does the same thing. I needed this to process RSS feeds properly.

5. Treat RSS feeds as normal webpages by adding links to next segment fetch (sample  commit <https://github.com/Elio-Earth/nutch/commit/1f8e4a9ee11de3c1a95926dbe279c0258d4c4438>)


Kamil

Reply via email to