Hello,

I am updating a nutch crawl that read files in directories that have
spaces. The urls show %20 instead of spaces. This doesn't seem to be what
the behavior was in the past.

In nutch 1.10 I get these results

Nutch 1.10



ParseData::
Version: 5
Status: success(1,0)
Title: Index of /nycor/10-15-2018 and on - Scanned
Outlinks: 4
  outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
2018/
  outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
2019/
  outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
2022/
  outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
Unknown/ anchor: Shipment Date Unknown/

in Nutch 1.19, I get this


ParseData::
Version: 5
Status: success(1,0)
Title: Index of /nycor/10-15-2018 and on - Scanned
Outlinks: 4
  outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
anchor: 2018/
  outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
anchor: 2019/
  outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
anchor: 2022/
  outlink: toUrl:
file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
anchor: Shipment Date Unknown/

We are uploading to solr and the links aren't right with the %20s in the
url. How do I remove the %20s?

Thanks,
Steve Cohen

Reply via email to