Thanks for the response Markus. disabling urlnormalizer-basic works. On Tue, Jan 9, 2024 at 3:43 PM Markus Jelsma <markus.jel...@openindex.io> wrote:
> Hello Steve, > > Having those spaces normalized/encoded is expected behaviour with > urlnormalizer-basic active. I would recommend to keep it this way and have > all URLs in Solr properly encoded. Having spaces in Solr IDs is also not > recommended as it can lead to unexpected behaviour. > > If you really don't want them encoded, disable urlnormalizer-basic in your > configuration. > > Regards, > Markus > > Op di 9 jan 2024 om 19:20 schreef Steve Cohen <mail4st...@gmail.com>: > > > Hello, > > > > I am updating a nutch crawl that read files in directories that have > > spaces. The urls show %20 instead of spaces. This doesn't seem to be what > > the behavior was in the past. > > > > In nutch 1.10 I get these results > > > > Nutch 1.10 > > > > > > > > ParseData:: > > Version: 5 > > Status: success(1,0) > > Title: Index of /nycor/10-15-2018 and on - Scanned > > Outlinks: 4 > > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor: > > 2018/ > > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor: > > 2019/ > > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor: > > 2022/ > > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date > > Unknown/ anchor: Shipment Date Unknown/ > > > > in Nutch 1.19, I get this > > > > > > ParseData:: > > Version: 5 > > Status: success(1,0) > > Title: Index of /nycor/10-15-2018 and on - Scanned > > Outlinks: 4 > > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/ > > anchor: 2018/ > > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/ > > anchor: 2019/ > > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/ > > anchor: 2022/ > > outlink: toUrl: > > > file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/ > > anchor: Shipment Date Unknown/ > > > > We are uploading to solr and the links aren't right with the %20s in the > > url. How do I remove the %20s? > > > > Thanks, > > Steve Cohen > > >