Hello Steve,

Having those spaces normalized/encoded is expected behaviour with
urlnormalizer-basic active. I would recommend to keep it this way and have
all URLs in Solr properly encoded. Having spaces in Solr IDs is also not
recommended as it can lead to unexpected behaviour.

If you really don't want them encoded, disable urlnormalizer-basic in your
configuration.

Regards,
Markus

Op di 9 jan 2024 om 19:20 schreef Steve Cohen <mail4st...@gmail.com>:

> Hello,
>
> I am updating a nutch crawl that read files in directories that have
> spaces. The urls show %20 instead of spaces. This doesn't seem to be what
> the behavior was in the past.
>
> In nutch 1.10 I get these results
>
> Nutch 1.10
>
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
> 2018/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
> 2019/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
> 2022/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
> Unknown/ anchor: Shipment Date Unknown/
>
> in Nutch 1.19, I get this
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
> anchor: 2018/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
> anchor: 2019/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
> anchor: 2022/
>   outlink: toUrl:
> file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
> anchor: Shipment Date Unknown/
>
> We are uploading to solr and the links aren't right with the %20s in the
> url. How do I remove the %20s?
>
> Thanks,
> Steve Cohen
>

Reply via email to