[
https://issues.apache.org/jira/browse/NUTCH-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367490#comment-14367490
]
ASF GitHub Bot commented on NUTCH-1968:
---------------------------------------
GitHub user renxiawang opened a pull request:
https://github.com/apache/nutch/pull/14
NUTCH-1968 resolved file extension too long issue
Reported by Xin Zhang: https://issues.apache.org/jira/browse/NUTCH-1968
Too long file extension fails the bin/nutch dump. Now limiting the length
of file extension to be 5 to solve this issue.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/renxiawang/nutch trunk
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nutch/pull/14.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14
----
commit 23d7d8f62dec166b210cca0f49883580dfbef48d
Author: Renxia Wang <[email protected]>
Date: 2015-03-12T10:01:38Z
NUTCH-1957 using MD5 as part of path and filename to solve filename
collision issue
commit 49db759fd61440689a6fe38997e969afc595658b
Author: rwang <[email protected]>
Date: 2015-03-18T16:34:37Z
Merge remote-tracking branch 'upstream/trunk' into trunk
commit dbf84893e5bbb98b491ba33121c49a3f84810670
Author: rwang <[email protected]>
Date: 2015-03-18T17:09:09Z
NUTCH-1968 resolved file extension too long issue
----
> File Name too long issue of DumpFileUtil.java file
> --------------------------------------------------
>
> Key: NUTCH-1968
> URL: https://issues.apache.org/jira/browse/NUTCH-1968
> Project: Nutch
> Issue Type: Bug
> Components: tool
> Affects Versions: 1.10
> Environment: Nutch 1.10 Revision 1667458
> Reporter: Xin Zhang
> Labels: dumper, filename
> Fix For: 1.10
>
> Attachments: EXTENSION_TOO_LONG.patch
>
>
> With the helpful patch that Renxia posts
> https://issues.apache.org/jira/browse/NUTCH-1957, I figure out that we need
> to solve the file name collision, otherwise we will lose data. However, when
> I use this patch to execute bin/nutch dump, I get file name too long error as
> follows:
> zhangxin0804@zhangxin0804-VirtualBox:~/Desktop/Nutch/nutch/runtime/local$
> bin/nutch dump -outputDir outputDir -segment TestCrawl2/segments
> java.io.FileNotFoundException:/home/zhangxin0804/Desktop/Nutch/nutch/runtime/local/outputDir/86/fc/830433456bfbcff5f7b53661cc24d9d4_maps.php?submitted=true&year=2014&month=6&imgs%5b%5d=nationaltavgrank&imgs%5b%5d=nationaltmaxrank&imgs%5b%5d=nationaltminrank&imgs%5b%5d=nationalpcpnrank&imgs%5b%5d=regionaltavgrank&imgs%5b%5d=regionaltmaxrank&imgs%5b%5d=regionaltminrank&imgs%5b%5d=regionalpcpnrank&imgs%5b%5d=statewidetavgrank&imgs%5b%5d=statewidetmaxrank&imgs%5b%5d=statewidetminrank&imgs%5b%5d=statewidepcpnrank&imgs%5b%5d=divisionaltavgrank&imgs%5b%5d=divisionaltmaxrank&imgs%5b%5d=divisionaltminrank&imgs%5b%5d=divisionalpcpnrank&ts=3
> (File name too long)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
> at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:221)
> at org.apache.nutch.tools.FileDumper.main(FileDumper.java:309)
> I dig into this patch and find it only checks the length of fileBaseName in
> /nutch/trunk/src/java/org/apache/nutch/util/DumpFileUtil.java. Therefore, if
> the <extension> is too long, the final outputFullPath is still too long which
> means it will throw exception in FileDumper.java Probably not everyone will
> meet this issue and it is maybe a minor bug, correct me if I am wrong.
> Meanwhile, is that OK to truncate fileExtension name as we did on fileBase
> name to solve this problem?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)