[
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357691#comment-14357691
]
Sebastian Nagel commented on NUTCH-1957:
----------------------------------------
Just a few thoughts to finally solve this problem (see also NUTCH-1950):
* a URL is a unique name for a resource in the www
* md5(url) should be also give a unique identifier
** ok, there may be collisions but if we take a 128-bit MD5 sum we definitely
hit a file system limit before, namely the max. number of files (in one
directory). A common practice to limit the number of files is to split the MD5
sum into block of 3-4 characters and use the first part(s) as directory
hierarchy, e.g.,
{{d7/a0/9ded039d2833ff602ac9d4cd5a8d_http_en_wikipedia_org_wiki_100}}.
** md5(content) has the disadvantage that the same URL if re-crawled is
possibly stored under a new file name
* everything else (extension, URL, file name) is only used to make the file
name human readable. We can freely skip some parts and/or special characters --
we do not risk any collisions.
* "As the FileDumper and the CommonCrawlDataDumper using the same way to store
file, we can make this a util." -- of course!
> FileDumper output file name collisions
> --------------------------------------
>
> Key: NUTCH-1957
> URL: https://issues.apache.org/jira/browse/NUTCH-1957
> Project: Nutch
> Issue Type: Bug
> Components: tool
> Affects Versions: 1.10
> Reporter: Renxia Wang
> Priority: Minor
> Labels: dumper, filename, tools
>
> The FileDumper extracts file base name and extension and use
> <basename>.<extension>(e.g. given the url
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the
> <basename>.<extension> will be project.html) as the file name to dump the
> file.
> Code from FileDumper.java:
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using
> bin/nutch dump.
> Sample logs:
> 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL:
> http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing:
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL:
> http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing:
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing:
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL:
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)