nutch adds %20 in urls instead of spaces

2024-01-09 Thread Steve Cohen
Hello, I am updating a nutch crawl that read files in directories that have spaces. The urls show %20 instead of spaces. This doesn't seem to be what the behavior was in the past. In nutch 1.10 I get these results Nutch 1.10 ParseData:: Version: 5 Status: success(1,0) Title: Index of

Re: nutch adds %20 in urls instead of spaces

2024-01-09 Thread Jim Anderson
unsubscribe On Tue, Jan 9, 2024 at 1:20 PM Steve Cohen wrote: > Hello, > > I am updating a nutch crawl that read files in directories that have > spaces. The urls show %20 instead of spaces. This doesn't seem to be what > the behavior was in the past. > > In nutch 1.10 I get these results > >

Re: nutch adds %20 in urls instead of spaces

2024-01-09 Thread Markus Jelsma
Hello Steve, Having those spaces normalized/encoded is expected behaviour with urlnormalizer-basic active. I would recommend to keep it this way and have all URLs in Solr properly encoded. Having spaces in Solr IDs is also not recommended as it can lead to unexpected behaviour. If you really