Thanks for the response Markus. disabling urlnormalizer-basic works.

On Tue, Jan 9, 2024 at 3:43 PM Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello Steve,
>
> Having those spaces normalized/encoded is expected behaviour with
> urlnormalizer-basic active. I would recommend to keep it this way and have
> all URLs in Solr properly encoded. Having spaces in Solr IDs is also not
> recommended as it can lead to unexpected behaviour.
>
> If you really don't want them encoded, disable urlnormalizer-basic in your
> configuration.
>
> Regards,
> Markus
>
> Op di 9 jan 2024 om 19:20 schreef Steve Cohen <mail4st...@gmail.com>:
>
> > Hello,
> >
> > I am updating a nutch crawl that read files in directories that have
> > spaces. The urls show %20 instead of spaces. This doesn't seem to be what
> > the behavior was in the past.
> >
> > In nutch 1.10 I get these results
> >
> > Nutch 1.10
> >
> >
> >
> > ParseData::
> > Version: 5
> > Status: success(1,0)
> > Title: Index of /nycor/10-15-2018 and on - Scanned
> > Outlinks: 4
> >   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
> > 2018/
> >   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
> > 2019/
> >   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
> > 2022/
> >   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
> > Unknown/ anchor: Shipment Date Unknown/
> >
> > in Nutch 1.19, I get this
> >
> >
> > ParseData::
> > Version: 5
> > Status: success(1,0)
> > Title: Index of /nycor/10-15-2018 and on - Scanned
> > Outlinks: 4
> >   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
> > anchor: 2018/
> >   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
> > anchor: 2019/
> >   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
> > anchor: 2022/
> >   outlink: toUrl:
> >
> file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
> > anchor: Shipment Date Unknown/
> >
> > We are uploading to solr and the links aren't right with the %20s in the
> > url. How do I remove the %20s?
> >
> > Thanks,
> > Steve Cohen
> >
>

Reply via email to