Renxia Wang created NUTCH-1957:
----------------------------------

             Summary: FileDumper output file name collisions
                 Key: NUTCH-1957
                 URL: https://issues.apache.org/jira/browse/NUTCH-1957
             Project: Nutch
          Issue Type: Bug
          Components: tool
    Affects Versions: 1.10
            Reporter: Renxia Wang
            Priority: Minor


The FileDumper extracts file base name and extension and use 
<basename>.<extension>(e.g. given the url 
https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
<basename>.<extension> will be project.html) as the file name to dump the file. 

Code from FileDumper.java: 

String url = key.toString();
String baseName = FilenameUtils.getBaseName(url);
String extension = FilenameUtils.getExtension(url);
...
String filename = baseName + "." + extension;

This introduce file name collision and leads to loss of data when using 
bin/nutch dump. 

Sample logs:
2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
http://beringsea.eol.ucar.edu/data/
2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
[testFileName/.html]: file already exists
2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
http://catalog.eol.ucar.edu/
2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
[testFileName/.html]: file already exists

2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Christopher%20Arp/project.html
2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Mary%20Albert/project.html
2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to