[
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357127#comment-14357127
]
Giuseppe Totaro commented on NUTCH-1957:
----------------------------------------
Hi [~zhique], I agree with your description. Using this "file-naming schema",
some collisions may occur. If two or more files have the same basename but
different pathname, only the first file will be written because all
deserialized files will be included in the same outputDir folder. Currently,
CommonCrawlDataDumpoer tool works in the same way.
I am working to solve it in CommonCrawlDataDumper tool (but it is the same in
FileDumper). We can use either a unique "key" value as filename (but it could
be very long) or the same structure/hierarchy as the input. In the latter case,
each output file has the same pathname as the original one.
Please give your feedback.
Thank you,
Giuseppe
> FileDumper output file name collisions
> --------------------------------------
>
> Key: NUTCH-1957
> URL: https://issues.apache.org/jira/browse/NUTCH-1957
> Project: Nutch
> Issue Type: Bug
> Components: tool
> Affects Versions: 1.10
> Reporter: Renxia Wang
> Priority: Minor
> Labels: dumper, filename, tools
>
> The FileDumper extracts file base name and extension and use
> <basename>.<extension>(e.g. given the url
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the
> <basename>.<extension> will be project.html) as the file name to dump the
> file.
> Code from FileDumper.java:
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using
> bin/nutch dump.
> Sample logs:
> 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL:
> http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing:
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL:
> http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing:
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)