Re: Checkstyle 80 char limit
Perhaps we should tell him these screens can also be turned landscape ? ;) But seriously: 1) Does Doug still actively work on code? From my perspective; Only very infrequently. 2) In the 200+ people IT department where I work I know only 1 colleague who uses his screen in portrait mode and he doesn't do code. So, for who do we really stick to the 80 chars limit? Niels On Tue, May 5, 2015 at 8:11 PM, Allen Wittenauer a...@altiscale.com wrote: On May 5, 2015, at 11:05 AM, Rich Haase rha...@pandora.com wrote: Can someone explain to me why on earth we care about limiting line length to 80 characters? Are there hadoop developers out there working from teletype terminals? Can we perhaps update this limit to something sane, like 120 chars? http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201407.mbox/%3CCALEq1Z8QvHof1A3zO0W5WGfbNjCOpfNo==jktq8jiu6efm_...@mail.gmail.com%3E -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Checkstyle 80 char limit
Quote from that page (i.e. the big block in RED at the top): The information on this page is for Archive Purposes Only This page is not being actively maintained. Links within the documentation may not work and the information itself may no longer be valid. The last revision to this document was made on April 20, 1999 So the 80 chars thing was at best reconsidered 16 years ago. Things have changed ... Niels Basjes On Tue, May 5, 2015 at 8:21 PM, Jonathan Eagles jeag...@gmail.com wrote: More formally, we follow the sun java coding standards which follow the 80 character limit. There is recent discussion over this, so it is a very relevant comment. http://wiki.apache.org/hadoop/HowToContribute http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-136091.html#313 On Tue, May 5, 2015 at 1:11 PM, Allen Wittenauer a...@altiscale.com wrote: On May 5, 2015, at 11:05 AM, Rich Haase rha...@pandora.com wrote: Can someone explain to me why on earth we care about limiting line length to 80 characters? Are there hadoop developers out there working from teletype terminals? Can we perhaps update this limit to something sane, like 120 chars? http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201407.mbox/%3CCALEq1Z8QvHof1A3zO0W5WGfbNjCOpfNo==jktq8jiu6efm_...@mail.gmail.com%3E -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Checkstyle 80 char limit
I respect immensely the contributions Doug has made to many Apache projects, Hear hear! I fully agree with that one. On Tue, May 5, 2015 at 8:30 PM, Rich Haase rha...@pandora.com wrote: I respect immensely the contributions Doug has made to many Apache projects, but I don’t think that should be a reason to force everyone to write code as if our screen sizes can’t support more than 80 characters. On May 5, 2015, at 12:21 PM, Niels Basjes ni...@basjes.nlmailto: ni...@basjes.nl wrote: Perhaps we should tell him these screens can also be turned landscape ? ;) But seriously: 1) Does Doug still actively work on code? From my perspective; Only very infrequently. 2) In the 200+ people IT department where I work I know only 1 colleague who uses his screen in portrait mode and he doesn't do code. So, for who do we really stick to the 80 chars limit? Niels On Tue, May 5, 2015 at 8:11 PM, Allen Wittenauer a...@altiscale.commailto: a...@altiscale.com wrote: On May 5, 2015, at 11:05 AM, Rich Haase rha...@pandora.commailto: rha...@pandora.com wrote: Can someone explain to me why on earth we care about limiting line length to 80 characters? Are there hadoop developers out there working from teletype terminals? Can we perhaps update this limit to something sane, like 120 chars? http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201407.mbox/%3CCALEq1Z8QvHof1A3zO0W5WGfbNjCOpfNo==jktq8jiu6efm_...@mail.gmail.com%3E -- Best regards / Met vriendelijke groeten, Niels Basjes Rich Haase| Sr. Software Engineer | Pandora m (303) 887-1146 | rha...@pandora.commailto:awils...@pandora.com -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: IMPORTANT: testing patches for branches
Perhaps a script is due that creates the patch file with -exactly- the right name. Something like what HBase has as dev-support/make_patch.sh perhaps? On Wed, Apr 22, 2015 at 10:30 PM, Allen Wittenauer a...@altiscale.com wrote: Oh, this is also in the release notes, but one can use a git reference # as well. :) (with kudos to OOM for the idea.) On Apr 22, 2015, at 8:57 PM, Allen Wittenauer a...@altiscale.com wrote: More than likely. It probably needs more testing (esp under Jenkins). It should be noted that the code in test-patch.sh has lots of problems with branch-0, minor, and micro releases. But for major releases, it seems to work well for me. :) On Apr 22, 2015, at 8:45 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Does this mean HADOOP-7435 is no longer needed / closeable as dup? Thanks +Vinod On Apr 22, 2015, at 12:34 PM, Allen Wittenauer a...@altiscale.com wrote: Hey gang, Just so everyone is aware, if you are working on a patch for either a feature branch or a major branch, if you name the patch with the branch name following the spec in HowToContribute (and a few other ways… test-patch tries to figure it out!), test-patch.sh *should* be switching the repo over to that branch for testing. For example, naming a patch foo-branch-2.01.patch should get tested on branch-2. Naming a patch foo-HDFS-7285.00.patch should get tested on the HDFS-7285 branch. This hopefully means that there should really be no more ‘blind’ +1’s to patches that go to branches. The “we only test against trunk” argument is no longer valid. :) -- Best regards / Met vriendelijke groeten, Niels Basjes
[jira] [Created] (HADOOP-11843) Make setting up the build environment easier
Niels Basjes created HADOOP-11843: - Summary: Make setting up the build environment easier Key: HADOOP-11843 URL: https://issues.apache.org/jira/browse/HADOOP-11843 Project: Hadoop Common Issue Type: New Feature Reporter: Niels Basjes Assignee: Niels Basjes ( As discussed with [~aw] ) In AVRO-1537 a docker based solution was created to setup all the tools for doing a full build. This enables much easier reproduction of any issues and getting up and running for new developers. This issue is to 'copy/port' that setup into the hadoop project in preparation for the bug squash. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: NFSv3 Filesystem Connector
Hi, The main reason Hadoop scales so well is because all components try to adhere to the idea around having Data Locality. In general this means that you are running the processing/query software on the system where the actual data is already present on the local disk. To me this NFS solution sounds like hooking the processing nodes to a shared storage solution. This may work for small clusters (say 5 nodes or so) but for large clusters this shared storage will be the main bottle neck in the processing/query speed. We currently have more than 20 nodes with 12 harddisks each resulting in over 50GB/sec [1] of disk-to-queryengine speed and this means that our setup already goes much faster than any network connection to any NFS solution can provide. We can simply go to say 50 nodes and exceed the 100GB/sec speed easy. So to me this sounds like hooking a scalable processing platform to a non scalable storage system (mainly because the network to this storage doesn't scale). So far I have only seen vendors of legacy storage solutions going in this direction ... oh wait ... you are NetApp ... that explains it. I am no committer in any of the Hadoop tools but I vote against having such a core concept breaking piece in the main codebase. New people may start to think it is a good idea to do this. So I say you should simply make this plugin available to your customers, just not as a core part of Hadoop. Niels Basjes [1] 50 GB/sec = approx 20*12*200MB/sec This page shows max read speed in the 200MB/sec range: http://www.tomshardware.com/charts/enterprise-hdd-charts/-02-Read-Throughput-Maximum-h2benchw-3.16,3372.html On Tue, Jan 13, 2015 at 10:35 PM, Gokul Soundararajan gokulsoun...@gmail.com wrote: Hi, We (Jingxin Feng, Xing Lin, and I) have been working on providing a FileSystem implementation that allows Hadoop to utilize a NFSv3 storage server as a filesystem. It leverages code from hadoop-nfs project for all the request/response handling. We would like your help to add it as part of hadoop tools (similar to the way hadoop-aws and hadoop-azure). In more detail, the Hadoop NFS Connector allows Apache Hadoop (2.2+) and Apache Spark (1.2+) to use a NFSv3 storage server as a storage endpoint. The NFS Connector can be run in two modes: (1) secondary filesystem - where Hadoop/Spark runs using HDFS as its primary storage and can use NFS as a second storage endpoint, and (2) primary filesystem - where Hadoop/Spark runs entirely on a NFSv3 storage server. The code is written in a way such that existing applications do not have to change. All one has to do is to copy the connector jar into the lib/ directory of Hadoop/Spark. Then, modify core-site.xml to provide the necessary details. The current version can be seen at: https://github.com/NetApp/NetApp-Hadoop-NFS-Connector It is my first time contributing to the Hadoop codebase. It would be great if someone on the Hadoop team can guide us through this process. I'm willing to make the necessary changes to integrate the code. What are the next steps? Should I create a JIRA entry? Thanks, Gokul -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Updates on migration to git
Hi, Great to see the move towards git. In terms of documentation could you please include the way binary files should be included in a patch (see this discussion https://www.mail-archive.com/common-dev%40hadoop.apache.org/msg13166.html ) and update http://wiki.apache.org/hadoop/GitAndHadoop too (this one still talks about the time when there were 3 projects). Thanks. -- Best regards / Met vriendelijke groeten, Niels Basjes
Deprecated configuration settings set from the core code / {core,hdfs,...}-default.xml ??
Hi, I found this because I was wondering why simply starting something as trivial as the pig grunt gives the following messages during startup: 2014-08-21 09:36:55,171 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - *mapred.job.tracker is deprecated*. Instead, use mapreduce.jobtracker.address 2014-08-21 09:36:55,172 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - *fs.default.name http://fs.default.name is deprecated*. Instead, use fs.defaultFS What I found is that these settings are not part of my config but they are part of the 'core hadoop' files. I found that the mapred.job.tracker is set from code when using the mapred package (probably this is what pig uses) https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/JobClient.java#L869 and that the fs.default.name is explicitly defined here as 'deprecated' in one of the *-default.xml config files. https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/resources/core-default.xml#L524 I did some more digging and found that there are several other properties that have been defined as deprecated that are still present in the various *-default.xml files throughout the hadoop code base. I used this list as a reference: https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/site/apt/DeprecatedProperties.apt.vm The ones I found so far: ./hadoop-common-project/hadoop-common/src/main/resources/core-default.xml: namefs.default.name/name ./hadoop-common-project/hadoop-common/src/main/resources/core-default.xml: nameio.bytes.per.checksum/name ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml: namemapreduce.job.counters.limit/name ./hadoop-tools/hadoop-distcp/src/main/resources/distcp-default.xml: namemapred.job.map.memory.mb/name ./hadoop-tools/hadoop-distcp/src/main/resources/distcp-default.xml: namemapred.job.reduce.memory.mb/name ./hadoop-tools/hadoop-distcp/src/main/resources/distcp-default.xml: namemapreduce.reduce.class/name Seems to me fixing these removes a lot of senseless clutter in the messaging in the console for end users. Or is there a good reason to keep it like this? -- Best regards / Met vriendelijke groeten, Niels Basjes
Hortonworks scripting ...
Hi, In the core Hadoop you can on your (desktop) client have multiple clusters available simply by having multiple directories with setting files (i.e. core-site.xml etc.) and select the one you want by changing the environment settings (i.e. HADOOP_CONF_DIR and such) around. This doesn't work when I run under the Hortonworks 2.1.2 distribution. There I find that in all of the scripts placed in /usr/bin/ there is mucking about with the environment settings. Things from /etc/default are sourced and they override my settings. Now I can control part of it by directing the BIGTOP_DEFAULTS_DIR into a blank directory. But in /usr/bin/pig sourcing /etc/default/hadoop hardcoded into the script. Why is this done this way? P.S. Where is the git(?) repo located where this (apperently HW specific) scripting is maintained? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Change proposal for FileInputFormat isSplitable
Hi, I talked to some people and they agreed with me that really the situation where this problem occurs is when they build a FileInputFormat derivative that also uses a LineRecordReader derivative. This is exactly the scenario that occurs if someone follows the Yahoo Hadoop tutorial. Instead of changing the FileInputFormat (which many of the committers considered to be a bad idea) I created a very simple patch for the LineRecordReader that throws an exception (intentionally failing the entire job) when it receives a split for a compressed file that had not been compressed using a SplittableCompressionCodec and where the split does not start at the beginning of the file. So fail if it detects a non splittable file that has been split. So if you run this against a 1GB gzipped file then the first split of the whole will complete successfully and all other splits will fail without even reading a single line. As far as I can tell this is a simple, clean and compatible patch that does not break anything. Also the change is limited to the most common place where this problem occurs. The only 'big' effect is that people who have been running a broken implementation will no longer be able to run this broken code iff they feed it 'large' non-splittable files. Which I thinks is a good thing. What do you (the committers) think of this approach? The patch I submitted a few days ago also includes the JavaDoc improvements (in FileInputFormat) provided by Gian Merlino https://issues.apache.org/jira/browse/MAPREDUCE-2094 Niels Basjes P.S. I still thing that the FileInputFormat.isSplitable() should implement a safe default instead of an optimistic default. On Sat, Jun 14, 2014 at 10:33 AM, Niels Basjes ni...@basjes.nl wrote: I did some digging through the code base and inspected all the situations I know where this goes wrong (including the yahoo tutorial) and found a place that may be a spot to avoid the effects of this problem. (Instead of solving the cause the problem) It turns out that all of those use cases use the LineRecordReader to read the data. This class (both the mapred and mapreduce versions) have the notion of the split that needs to be read, if the file is compressed and if this is a splittable compression codec. Now if we were to add code there that validates if the provided splits are valid or not (i.e. did the developer make this bug or not) then we could avoid the garbage data problem before it is fed into the actual mapper. This must then write error messages (+ message did you know you have been looking at corrupted data for a long time) that will appear in the logs of all the mapper attempts. At that point we can do one of these two actions in the LineRecordReader: - Fail hard with an exception. The job fails and the user immediately goes to the developer of the inputformat with a bug report. - Avoid the problem: Read the entire file iff the start of the split is 0, else read nothing. Many users will see a dramatic change in their results and (hopefully) start digging deeper. (Iff a human actually looks at the data) I vote for the fail hard because then people are forced to fix the problem and correct the historical impact. Would this be a good / compatible solution? If so then I think we should have this in both the 2.x and 3.x. For the 3.x I also realized that perhaps the isSplittable is something that could be delegated to the record reader. Would that make sense or is this something that does not belong there? If not then I would still propose making the isSplittable abstract to fix the problem before it is created (in 3.x) Niels Basjes On Jun 13, 2014 11:47 PM, Chris Douglas cdoug...@apache.org wrote: On Fri, Jun 13, 2014 at 2:54 AM, Niels Basjes ni...@basjes.nl wrote: Hmmm, people only look at logs when they have a problem. So I don't think this would be enough. This change to the framework will cause disruptions to users, to aid InputFormat authors' debugging. The latter is a much smaller population and better equipped to handle this complexity. A log statement would print during submission, so it would be visible to users. If a user's job is producing garbage but submission was non-interactive, a log statement would be sufficient to debug the issue. If the naming conflict is common in some contexts, the warning can be disabled using the log configuration. Beyond that, input validation is the responsibility of the InputFormat author. Perhaps this makes sense: - For 3.0: Shout at the developer who does it wrong (i.e. make it abstract and force them to think about this) i.e. Create new abstract method isSplittable (tt) in FileInputFormat, remove isSplitable (one t). To avoid needless code duplication (which we already have in the codebase) create a helper method something like 'fileNameIndicatesSplittableFile' ( returns enum: Splittable/NonSplittable/Unknown ). - For 2.x: Keep the enduser safe
Re: Jenkins problem or patch problem?
I think this behavior is better. This way you know you patch was not (fully) applied. It would be even better if there was a way to submit a patch with a binary file in there. Niels On Mon, Jul 28, 2014 at 11:29 PM, Andrew Wang andrew.w...@cloudera.com wrote: I had the same issue on HDFS-6696, patch generated with git diff --binary. I ended up making the same patch without the binary part and it could be applied okay. This does differ in behavior from the old boxes, which were still able to apply the non-binary parts of a binary-diff. On Mon, Jul 28, 2014 at 3:06 AM, Niels Basjes ni...@basjes.nl wrote: For my test case I needed a something.txt.gz file However for this specific test this file will never be actually read, it just has to be there and it must be a few bytes in size. Because binary files do't work I simply created a file containging Hello world Now this isn't a gzip file at all, yet for my test it does enough to make the test work as intended. So in fact I didn't solve the binary attachment problem at all. On Mon, Jul 28, 2014 at 1:40 AM, Ted Yu yuzhih...@gmail.com wrote: Mind telling us how you included the binary file in your svn patch ? Thanks On Sun, Jul 27, 2014 at 12:27 PM, Niels Basjes ni...@basjes.nl wrote: I created a patch file with SVN and it works now. I dare to ask: Are there any git created patch files that work? On Sun, Jul 27, 2014 at 9:44 PM, Niels Basjes ni...@basjes.nl wrote: I'll look for a workaround regarding the binary file. Thanks. On Sun, Jul 27, 2014 at 9:07 PM, Ted Yu yuzhih...@gmail.com wrote: Similar problem has been observed for HBase patches. Have you tried attaching level 1 patch ? For the binary file, to my knowledge, 'git apply' is able to handle it but hadoop is currently using svn. Cheers On Sun, Jul 27, 2014 at 11:01 AM, Niels Basjes ni...@basjes.nl wrote: Hi, I just submitted a patch and Jenkins said it failed to apply the patch. But when I look at the console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console it says: At revision 1613826. MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC 2014 fromhttp:// issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp : cannot stat '/home/jenkins/buildSupport/lib/*': No such file or directory *The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED Now I do have a binary file (for the unit test) in this patch, perhaps I did something wrong? Or is this problem caused by the error I highlighted? What can I do to fix this? -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Jenkins problem or patch problem?
For my test case I needed a something.txt.gz file However for this specific test this file will never be actually read, it just has to be there and it must be a few bytes in size. Because binary files do't work I simply created a file containging Hello world Now this isn't a gzip file at all, yet for my test it does enough to make the test work as intended. So in fact I didn't solve the binary attachment problem at all. On Mon, Jul 28, 2014 at 1:40 AM, Ted Yu yuzhih...@gmail.com wrote: Mind telling us how you included the binary file in your svn patch ? Thanks On Sun, Jul 27, 2014 at 12:27 PM, Niels Basjes ni...@basjes.nl wrote: I created a patch file with SVN and it works now. I dare to ask: Are there any git created patch files that work? On Sun, Jul 27, 2014 at 9:44 PM, Niels Basjes ni...@basjes.nl wrote: I'll look for a workaround regarding the binary file. Thanks. On Sun, Jul 27, 2014 at 9:07 PM, Ted Yu yuzhih...@gmail.com wrote: Similar problem has been observed for HBase patches. Have you tried attaching level 1 patch ? For the binary file, to my knowledge, 'git apply' is able to handle it but hadoop is currently using svn. Cheers On Sun, Jul 27, 2014 at 11:01 AM, Niels Basjes ni...@basjes.nl wrote: Hi, I just submitted a patch and Jenkins said it failed to apply the patch. But when I look at the console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console it says: At revision 1613826. MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC 2014 fromhttp:// issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp : cannot stat '/home/jenkins/buildSupport/lib/*': No such file or directory *The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED Now I do have a binary file (for the unit test) in this patch, perhaps I did something wrong? Or is this problem caused by the error I highlighted? What can I do to fix this? -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Jenkins problem or patch problem?
Hi, I just submitted a patch and Jenkins said it failed to apply the patch. But when I look at the console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console it says: At revision 1613826. MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC 2014 fromhttp://issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp: cannot stat '/home/jenkins/buildSupport/lib/*': No such file or directory *The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED Now I do have a binary file (for the unit test) in this patch, perhaps I did something wrong? Or is this problem caused by the error I highlighted? What can I do to fix this? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Jenkins problem or patch problem?
There are several other jobs (completely unrelated to my patch) that failed with exactly the same error. https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4770/console https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4769/console https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4768/console So I say Jenkins problem for now. On Sun, Jul 27, 2014 at 9:01 PM, Niels Basjes ni...@basjes.nl wrote: Hi, I just submitted a patch and Jenkins said it failed to apply the patch. But when I look at the console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console it says: At revision 1613826. MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC 2014 fromhttp://issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp: cannot stat '/home/jenkins/buildSupport/lib/*': No such file or directory *The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED Now I do have a binary file (for the unit test) in this patch, perhaps I did something wrong? Or is this problem caused by the error I highlighted? What can I do to fix this? -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Jenkins problem or patch problem?
I'll look for a workaround regarding the binary file. Thanks. On Sun, Jul 27, 2014 at 9:07 PM, Ted Yu yuzhih...@gmail.com wrote: Similar problem has been observed for HBase patches. Have you tried attaching level 1 patch ? For the binary file, to my knowledge, 'git apply' is able to handle it but hadoop is currently using svn. Cheers On Sun, Jul 27, 2014 at 11:01 AM, Niels Basjes ni...@basjes.nl wrote: Hi, I just submitted a patch and Jenkins said it failed to apply the patch. But when I look at the console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console it says: At revision 1613826. MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC 2014 fromhttp:// issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp : cannot stat '/home/jenkins/buildSupport/lib/*': No such file or directory *The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED Now I do have a binary file (for the unit test) in this patch, perhaps I did something wrong? Or is this problem caused by the error I highlighted? What can I do to fix this? -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Jenkins problem or patch problem?
I created a patch file with SVN and it works now. I dare to ask: Are there any git created patch files that work? On Sun, Jul 27, 2014 at 9:44 PM, Niels Basjes ni...@basjes.nl wrote: I'll look for a workaround regarding the binary file. Thanks. On Sun, Jul 27, 2014 at 9:07 PM, Ted Yu yuzhih...@gmail.com wrote: Similar problem has been observed for HBase patches. Have you tried attaching level 1 patch ? For the binary file, to my knowledge, 'git apply' is able to handle it but hadoop is currently using svn. Cheers On Sun, Jul 27, 2014 at 11:01 AM, Niels Basjes ni...@basjes.nl wrote: Hi, I just submitted a patch and Jenkins said it failed to apply the patch. But when I look at the console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console it says: At revision 1613826. MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC 2014 fromhttp:// issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp : cannot stat '/home/jenkins/buildSupport/lib/*': No such file or directory *The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED Now I do have a binary file (for the unit test) in this patch, perhaps I did something wrong? Or is this problem caused by the error I highlighted? What can I do to fix this? -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Change proposal for FileInputFormat isSplitable
Hi, On Wed, Jun 11, 2014 at 8:25 PM, Chris Douglas cdoug...@apache.org wrote: On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes ni...@basjes.nl wrote: That's not what I meant. What I understood from what was described is that sometimes people use an existing file extension (like .gz) for a file that is not a gzipped file. Understood, but this change also applies to other loaded codecs, like .lzo, .bz, etc. Adding a new codec changes the default behavior for all InputFormats that don't override this method. Yes it would. I think that forcing the developer of the file based inputformat to implement this would be the best way to go. Making this method abstract is the first thing that spring to mind. This would break backwards compatibility so I think we can only do that with the 3.0.0 version I consider silently producing garbage one of the worst kinds of problem to tackle. Because many custom file based input formats have stumbled (getting silently produced garbage) over the current isSplitable implementation I really want to avoid any more of this in the future. That is why I want to change the implementations in this area of Hadoop in such a way that this silently producing garbage effect is taken out. Adding validity assumptions to a common base class will affect a lot of users, most of whom are not InputFormat authors. True, the thing is that if a user uses an InputFormat written by someone else and then it silently produces garbage they are also affected in a much worse way. So the question remains: What is the way this should be changed? I'm willing to build it and submit a patch. Would a logged warning suffice? This would aid debugging without an incompatible change in behavior. It could also be disabled easily. -C Hmmm, people only look at logs when they have a problem. So I don't think this would be enough. Perhaps this makes sense: - For 3.0: Shout at the developer who does it wrong (i.e. make it abstract and force them to think about this) i.e. Create new abstract method isSplittable (tt) in FileInputFormat, remove isSplitable (one t). To avoid needless code duplication (which we already have in the codebase) create a helper method something like 'fileNameIndicatesSplittableFile' ( returns enum: Splittable/NonSplittable/Unknown ). - For 2.x: Keep the enduser safe: Avoid silently producing garbage in all situations where the developer already did it wrong. (i.e. change isSplitable == return false) This costs performance only in those situations where the developer actually did it wrong (i.e. they didn't thing this through) How about that? P.S. I created an issue for the NLineInputFormat problem I found: https://issues.apache.org/jira/browse/MAPREDUCE-5925 -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Change proposal for FileInputFormat isSplitable
On Tue, Jun 10, 2014 at 8:10 PM, Chris Douglas cdoug...@apache.org wrote: On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes ni...@basjes.nl wrote: and if you then give the file the .gz extension this breaks all common sense / conventions about file names. That the suffix for all compression codecs in every context- and all future codecs- should determine whether a file can be split is not an assumption we can make safely. Again, that's not an assumption that held when people built their current systems, and they would be justly annoyed with the project for changing it. That's not what I meant. What I understood from what was described is that sometimes people use an existing file extension (like .gz) for a file that is not a gzipped file. If a file is splittable or not depends greatly on the actual codec implementation that is used to read it. Using the default GzipCodec a .gz file is not splittable, but that can be changed with a different implementation like for example this https://github.com/nielsbasjes/splittablegzip So given a file extension the file 'must' be a file that is the format that is described by the file name extension. The flow is roughly as follows - What is the file extension - Get the codec class registered to that extension - Is this a splittable codec ? (Does this class implement the splittablecodec interface) I hold correct data much higher than performance and scalability; so the performance impact is a concern but it is much less important than the list of bugs we are facing right now. These are not bugs. NLineInputFormat doesn't support compressed input, and why would it? -C I'm not saying it should (in fact, for this one I agree that it shouldn't). The reality is that it accepts the file, decompresses it and then produces output that 'looks good' but really is garbage. I consider silently producing garbage one of the worst kinds of problem to tackle. Because many custom file based input formats have stumbled (getting silently produced garbage) over the current isSplitable implementation I really want to avoid any more of this in the future. That is why I want to change the implementations in this area of Hadoop in such a way that this silently producing garbage effect is taken out. So the question remains: What is the way this should be changed? I'm willing to build it and submit a patch. The safest way would be either 2 or 4. Solution 3 would effectively be the same as the current implementation, yet it would catch the problem situations as long as people stick to normal file name conventions. Solution 3 would also allow removing some code duplication in several subclasses. I would go for solution 3. Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Change proposal for FileInputFormat isSplitable
On Mon, Jun 2, 2014 at 1:21 AM, Chris Douglas cdoug...@apache.org wrote: On Sat, May 31, 2014 at 10:53 PM, Niels Basjes ni...@basjes.nl wrote: The Hadoop framework uses the filename extension to automatically insert the right decompression codec in the read pipeline. This would be the new behavior, incompatible with existing code. You are right, I was wrong. It is the LineRecordReader that inserts it. Looking at this code and where it is used I noticed that the bug I'm trying to prevent is present in the current trunk. The NLineInputFormat does not override the isSplitable and used the LineRecordReader that is capable of reading gzipped input. Overall effect is that this inputformat silently produces garbage (missing lines + duplicated lines) when when ran against a gzipped file. I just verified this. So if someone does what you describe then they would need to unload all compression codecs or face decompression errors. And if it really was gzipped then it would not be splittable at all. Assume an InputFormat configured for a job assumes that isSplitable returns true because it extends FileInputFormat. After the change, it could spuriously return false based on the suffix of the input files. In the prenominate example, SequenceFile is splittable, even if the codec used in each block is not. -C and if you then give the file the .gz extension this breaks all common sense / conventions about file names. Let's reiterate the options I see now: 1) isSplitable -- return true Too unsafe, I say must change. I alone hit my head twice so far on this, many others have too, event the current trunk still has this bug in there. 2) isSplitable -- return false Safe but too slow in some cases. In those cases the actual implementation can simply override it very easily and regain their original performance. 3) isSplitable -- true (same as the current implementation) unless you use a file extension that is associated with a non splittable compression codec (i.e. .gz or something like that). If a custom format want to break with well known conventions about filenames then they should simply override the isSplitable with their own. 4) isSplitable -- abstract Compatibility breaker. I see this as the cleanest way to force the developer of the custom fileinputformat to think about their specific case. I hold correct data much higher than performance and scalability; so the performance impact is a concern but it is much less important than the list of bugs we are facing right now. The safest way would be either 2 or 4. Solution 3 would effectively be the same as the current implementation, yet it would catch the problem situations as long as people stick to normal file name conventions. Solution 3 would also allow removing some code duplication in several subclasses. I would go for solution 3. Niels Basjes
Re: Change proposal for FileInputFormat isSplitable
Ok, got it. If someone has an Avro file (foo.avro) and gzips that ( foo.avro.gz) then the frame work will select the GzipCodec which is not capable of splitting and which will cause the problem. So by gzipping a splittable file it becomes non splittable. At my workplace we have applied gzip to avro but then the compression applies to the blocks inside the avro file. So that are multiple gzipped blocks inside an avro container which is a splittable file without any changes. How would someone create the situation you are referring to? On May 31, 2014 1:06 AM, Doug Cutting cutt...@apache.org wrote: I was trying to explain my comment, where I stated that, changing the default implementation to return false would be an incompatible change. The patch was added 6 months after that comment, so the comment didn't address the patch. The patch does not appear to change the default implementation to return false unless the suffix of the file name is that of a known unsplittable compression format. So the folks who'd be harmed by this are those who used a suffix like .gz for an Avro, Parquet or other-format file. Their applications might suddenly run much slower and it would be difficult for them to determine why. Such folks are probably few, but perhaps exist. I'd prefer a change that avoided that possibility entirely. Doug On Fri, May 30, 2014 at 3:02 PM, Niels Basjes ni...@basjes.nl wrote: Hi, The way I see the effects of the original patch on existing subclasses: - implemented isSplitable -- no performance difference. - did not implement isSplitable -- then there is no performance difference if the container is either not compressed or uses a splittable compression. -- If it uses a common non splittable compression (like gzip) then the output will suddenly be different (which is the correct answer) and the jobs will finish sooner because the input is not processed multiple times. Where do you see a performance impact? Niels On May 30, 2014 8:06 PM, Doug Cutting cutt...@apache.org wrote: On Thu, May 29, 2014 at 2:47 AM, Niels Basjes ni...@basjes.nl wrote: For arguments I still do not fully understand this was rejected by Todd and Doug. Performance is a part of compatibility. Doug
Re: Change proposal for FileInputFormat isSplitable
The Hadoop framework uses the filename extension to automatically insert the right decompression codec in the read pipeline. So if someone does what you describe then they would need to unload all compression codecs or face decompression errors. And if it really was gzipped then it would not be splittable at all. Niels On May 31, 2014 11:12 PM, Chris Douglas cdoug...@apache.org wrote: On Fri, May 30, 2014 at 11:05 PM, Niels Basjes ni...@basjes.nl wrote: How would someone create the situation you are referring to? By adopting a naming convention where the filename suffix doesn't imply that the raw data are compressed with that codec. For example, if a user named SequenceFiles foo.lzo and foo.gz to record which codec was used, then isSplittable would spuriously return false. -C On May 31, 2014 1:06 AM, Doug Cutting cutt...@apache.org wrote: I was trying to explain my comment, where I stated that, changing the default implementation to return false would be an incompatible change. The patch was added 6 months after that comment, so the comment didn't address the patch. The patch does not appear to change the default implementation to return false unless the suffix of the file name is that of a known unsplittable compression format. So the folks who'd be harmed by this are those who used a suffix like .gz for an Avro, Parquet or other-format file. Their applications might suddenly run much slower and it would be difficult for them to determine why. Such folks are probably few, but perhaps exist. I'd prefer a change that avoided that possibility entirely. Doug On Fri, May 30, 2014 at 3:02 PM, Niels Basjes ni...@basjes.nl wrote: Hi, The way I see the effects of the original patch on existing subclasses: - implemented isSplitable -- no performance difference. - did not implement isSplitable -- then there is no performance difference if the container is either not compressed or uses a splittable compression. -- If it uses a common non splittable compression (like gzip) then the output will suddenly be different (which is the correct answer) and the jobs will finish sooner because the input is not processed multiple times. Where do you see a performance impact? Niels On May 30, 2014 8:06 PM, Doug Cutting cutt...@apache.org wrote: On Thu, May 29, 2014 at 2:47 AM, Niels Basjes ni...@basjes.nl wrote: For arguments I still do not fully understand this was rejected by Todd and Doug. Performance is a part of compatibility. Doug
Re: Change proposal for FileInputFormat isSplitable
My original proposal (from about 3 years ago) was to change the isSplitable method to return a safe default ( you can see that in the patch that is still attached to that Jira issue). For arguments I still do not fully understand this was rejected by Todd and Doug. So that is why my new proposal is to deprecate (remove!) the old method with the typo in Hadoop 3.0 and replace it with something correct and less error prone. Given the fact that this would happen in a major version jump I thought that would be the right time to do that. Niels On Thu, May 29, 2014 at 11:34 AM, Steve Loughran ste...@hortonworks.comwrote: On 28 May 2014 20:50, Niels Basjes ni...@basjes.nl wrote: Hi, Last week I ran into this problem again https://issues.apache.org/jira/browse/MAPREDUCE-2094 What happens here is that the default implementation of the isSplitable method in FileInputFormat is so unsafe that just about everyone who implements a new subclass is likely to get this wrong. The effect of getting this wrong is that all unit tests succeed and running it against 'large' input files (64MiB) that are compressed using a non-splittable compression (often Gzip) will cause the input to be fed into the mappers multiple time (i.e. you get garbage results without ever seeing any errors). Last few days I was at Berlin buzzwords talking to someone about this bug that was me, I recall. and this resulted in the following proposal which I would like your feedback on. 1) This is a change that will break backwards compatibility (deliberate choice). 2) The FileInputFormat will get 3 methods (the old isSplitable with the typo of one 't' in the name will disappear): (protected) isSplittableContainer -- true unless compressed with non-splittable compression. (protected) isSplittableContent -- abstract, MUST be implemented by the subclass (public) isSplittable -- isSplittableContainer isSplittableContent The idea is that only the isSplittable is used by other classes to know if this is a splittable file. The effect I hope to get is that a developer writing their own fileinputformat (which I alone have done twice so far) is 'forced' and 'helped' getting this right. I could see making the attributes more explicit would be good -but stopping everything that exists from working isn't going to fly. what about some subclass, AbstractSplittableFileInputFormat that implements the container properly, requires that content one -and then calculates IsSplitable() from the results? Existing code: no change, new formats can descend from this (and built in ones retrofitted). The reason for me to propose this as an incompatible change is that this way I hope to eradicate some of the existing bugs in custom implementations 'out there'. P.S. If you agree to this change then I'm willing to put my back into it and submit a patch. -- Best regards, Niels Basjes -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Change proposal for FileInputFormat isSplitable
This is exactly why I'm proposing a change that will either 'fix silently' (my original patch from 3 years ago) or 'break loudly' (my current proposal) old implementations. I'm convinced that ther are atleast 100 companies world wide that have a custom implementation with this bug and have no clue they have been basing descision upon silently corrupted data. On Thu, May 29, 2014 at 1:21 PM, Jay Vyas jayunit...@gmail.com wrote: I think breaking backwards compat is sensible since It's easily caught by the compiler and in this case where the alternative is a Runtime error that can result in terabytes of mucked up output. On May 29, 2014, at 6:11 AM, Matt Fellows matt.fell...@bespokesoftware.com wrote: As someone who doesn't really contribute, just lurks, I could well be misinformed or under-informed, but I don't see why we can't deprecate a method which could cause dangerous side effects? People can still use the deprecated methods for backwards compatibility, but are discouraged by compiler warnings, and any changes they write to their code can start to use the new functionality? *Apologies if I'm stepping into a Hadoop holy war here On Thu, May 29, 2014 at 10:47 AM, Niels Basjes ni...@basjes.nl wrote: My original proposal (from about 3 years ago) was to change the isSplitable method to return a safe default ( you can see that in the patch that is still attached to that Jira issue). For arguments I still do not fully understand this was rejected by Todd and Doug. So that is why my new proposal is to deprecate (remove!) the old method with the typo in Hadoop 3.0 and replace it with something correct and less error prone. Given the fact that this would happen in a major version jump I thought that would be the right time to do that. Niels On Thu, May 29, 2014 at 11:34 AM, Steve Loughran ste...@hortonworks.comwrote: On 28 May 2014 20:50, Niels Basjes ni...@basjes.nl wrote: Hi, Last week I ran into this problem again https://issues.apache.org/jira/browse/MAPREDUCE-2094 What happens here is that the default implementation of the isSplitable method in FileInputFormat is so unsafe that just about everyone who implements a new subclass is likely to get this wrong. The effect of getting this wrong is that all unit tests succeed and running it against 'large' input files (64MiB) that are compressed using a non-splittable compression (often Gzip) will cause the input to be fed into the mappers multiple time (i.e. you get garbage results without ever seeing any errors). Last few days I was at Berlin buzzwords talking to someone about this bug that was me, I recall. and this resulted in the following proposal which I would like your feedback on. 1) This is a change that will break backwards compatibility (deliberate choice). 2) The FileInputFormat will get 3 methods (the old isSplitable with the typo of one 't' in the name will disappear): (protected) isSplittableContainer -- true unless compressed with non-splittable compression. (protected) isSplittableContent -- abstract, MUST be implemented by the subclass (public) isSplittable -- isSplittableContainer isSplittableContent The idea is that only the isSplittable is used by other classes to know if this is a splittable file. The effect I hope to get is that a developer writing their own fileinputformat (which I alone have done twice so far) is 'forced' and 'helped' getting this right. I could see making the attributes more explicit would be good -but stopping everything that exists from working isn't going to fly. what about some subclass, AbstractSplittableFileInputFormat that implements the container properly, requires that content one -and then calculates IsSplitable() from the results? Existing code: no change, new formats can descend from this (and built in ones retrofitted). The reason for me to propose this as an incompatible change is that this way I hope to eradicate some of the existing bugs in custom implementations 'out there'. P.S. If you agree to this change then I'm willing to put my back into it and submit a patch. -- Best regards, Niels Basjes -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately
Re: Change proposal for FileInputFormat isSplitable
I forgot to ask a relevant question: What made the original proposed solution incompatible? To me it still seems to be a clean backward compatible solution that fixes this issue in a simple way. Perhaps Todd can explain why? Niels On May 29, 2014 2:17 PM, Niels Basjes ni...@basjes.nl wrote: This is exactly why I'm proposing a change that will either 'fix silently' (my original patch from 3 years ago) or 'break loudly' (my current proposal) old implementations. I'm convinced that ther are atleast 100 companies world wide that have a custom implementation with this bug and have no clue they have been basing descision upon silently corrupted data. On Thu, May 29, 2014 at 1:21 PM, Jay Vyas jayunit...@gmail.com wrote: I think breaking backwards compat is sensible since It's easily caught by the compiler and in this case where the alternative is a Runtime error that can result in terabytes of mucked up output. On May 29, 2014, at 6:11 AM, Matt Fellows matt.fell...@bespokesoftware.com wrote: As someone who doesn't really contribute, just lurks, I could well be misinformed or under-informed, but I don't see why we can't deprecate a method which could cause dangerous side effects? People can still use the deprecated methods for backwards compatibility, but are discouraged by compiler warnings, and any changes they write to their code can start to use the new functionality? *Apologies if I'm stepping into a Hadoop holy war here On Thu, May 29, 2014 at 10:47 AM, Niels Basjes ni...@basjes.nl wrote: My original proposal (from about 3 years ago) was to change the isSplitable method to return a safe default ( you can see that in the patch that is still attached to that Jira issue). For arguments I still do not fully understand this was rejected by Todd and Doug. So that is why my new proposal is to deprecate (remove!) the old method with the typo in Hadoop 3.0 and replace it with something correct and less error prone. Given the fact that this would happen in a major version jump I thought that would be the right time to do that. Niels On Thu, May 29, 2014 at 11:34 AM, Steve Loughran ste...@hortonworks.comwrote: On 28 May 2014 20:50, Niels Basjes ni...@basjes.nl wrote: Hi, Last week I ran into this problem again https://issues.apache.org/jira/browse/MAPREDUCE-2094 What happens here is that the default implementation of the isSplitable method in FileInputFormat is so unsafe that just about everyone who implements a new subclass is likely to get this wrong. The effect of getting this wrong is that all unit tests succeed and running it against 'large' input files (64MiB) that are compressed using a non-splittable compression (often Gzip) will cause the input to be fed into the mappers multiple time (i.e. you get garbage results without ever seeing any errors). Last few days I was at Berlin buzzwords talking to someone about this bug that was me, I recall. and this resulted in the following proposal which I would like your feedback on. 1) This is a change that will break backwards compatibility (deliberate choice). 2) The FileInputFormat will get 3 methods (the old isSplitable with the typo of one 't' in the name will disappear): (protected) isSplittableContainer -- true unless compressed with non-splittable compression. (protected) isSplittableContent -- abstract, MUST be implemented by the subclass (public) isSplittable -- isSplittableContainer isSplittableContent The idea is that only the isSplittable is used by other classes to know if this is a splittable file. The effect I hope to get is that a developer writing their own fileinputformat (which I alone have done twice so far) is 'forced' and 'helped' getting this right. I could see making the attributes more explicit would be good -but stopping everything that exists from working isn't going to fly. what about some subclass, AbstractSplittableFileInputFormat that implements the container properly, requires that content one -and then calculates IsSplitable() from the results? Existing code: no change, new formats can descend from this (and built in ones retrofitted). The reason for me to propose this as an incompatible change is that this way I hope to eradicate some of the existing bugs in custom implementations 'out there'. P.S. If you agree to this change then I'm willing to put my back into it and submit a patch. -- Best regards, Niels Basjes -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law
Change proposal for FileInputFormat isSplitable
Hi, Last week I ran into this problem again https://issues.apache.org/jira/browse/MAPREDUCE-2094 What happens here is that the default implementation of the isSplitable method in FileInputFormat is so unsafe that just about everyone who implements a new subclass is likely to get this wrong. The effect of getting this wrong is that all unit tests succeed and running it against 'large' input files (64MiB) that are compressed using a non-splittable compression (often Gzip) will cause the input to be fed into the mappers multiple time (i.e. you get garbage results without ever seeing any errors). Last few days I was at Berlin buzzwords talking to someone about this bug and this resulted in the following proposal which I would like your feedback on. 1) This is a change that will break backwards compatibility (deliberate choice). 2) The FileInputFormat will get 3 methods (the old isSplitable with the typo of one 't' in the name will disappear): (protected) isSplittableContainer -- true unless compressed with non-splittable compression. (protected) isSplittableContent -- abstract, MUST be implemented by the subclass (public) isSplittable -- isSplittableContainer isSplittableContent The idea is that only the isSplittable is used by other classes to know if this is a splittable file. The effect I hope to get is that a developer writing their own fileinputformat (which I alone have done twice so far) is 'forced' and 'helped' getting this right. The reason for me to propose this as an incompatible change is that this way I hope to eradicate some of the existing bugs in custom implementations 'out there'. P.S. If you agree to this change then I'm willing to put my back into it and submit a patch. -- Best regards, Niels Basjes
Re: Sorting user defined MR counters.
Hi Steve, Now for submitting changes for Hadoop: Is it desirable that I fix these in my change set or should I leave these as-is to avoid obfuscating the changes that are relevant to the Jira at hand? I recommend a cleanup first -that's likely to go in without any argument. Your patch with the new features would be a diff against the clean, so have less changes to be reviewed. Ok, I'll have a look what I can do. Should I focus on fixing problems within the entire code base or limit my changes to a limited set of subprojects (i.e. only the mapreduce ones) ? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Making Gzip splittable
Hi, On Wed, Feb 22, 2012 at 19:14, Tim Broberg tim.brob...@exar.com wrote: There are three options here: 1 - Add your codec, and alternative to the default gzip codec. 2 - Modify the gzip codec to incorporate your feature so that it is pseudo-splittable by default (skippable?) 3 - Do nothing The code uses the normal splittability interface and doesn't invent some new solution. It seems perfectly well explained. The choice was made to implement it as a separate 'Codec' that reuses all decompression functionality from the existing GzipCodec without making any changes to the original. This way there is no duplicate code and there is no risk the existing GzipCodec is affected by the new functionality. This was actually one of the first review comments I got on one of the first versions (which did have a few minor changes in the GzipCodec). So that is why option '1' was chosen instead of '2'. There is a lot of explanation in there on how to switch over from one codec to the other. Enabling the codec is only one setting. There is however a quite a bit of documentation on the matter how to use it. Does it all get simpler if skippability is implemented by default but the option is not enabled? There are two answers to this: 1) No, it won't get simpler. 2) This feature cannot be disabled per codec. The reason is that the framework creates splits if the applicable codec implements the SplittableCompressionCodec. This check is done purely by doing an instanceof check. After that the FileInputFormat creates the splits without consulting the codec class at all. So either a codec is splittable or not. And the splits are defined independent of the codec. So there is (unfortunately) currently no way to create a codec that can be splittable/non-splittable by using a config setting. Does this make things any less potentially confusing? I don't think this would make it less confusing. Niels - Tim. From: ni...@basj.es [ni...@basj.es] On Behalf Of Niels Basjes [ ni...@basjes.nl] Sent: Sunday, February 19, 2012 4:23 PM To: common-dev Subject: Making Gzip splittable Hi, As some of you know I've created a patch that effectively makes Gzip splittable. https://issues.apache.org/jira/browse/HADOOP-7076 What this does is for a split somewhere in the middle of the file it will read from the start of the file up until the point where the split starts. This is a useful waste of resources because it creates room to run a heavy lifting mapper in parallel. Due to this balance between the waste being useful and the waste being wasteful I've included extensive documentation in the patch on how it works and how to use it. I've seen that there are quite a few real life situations where I expect my solution can be useful. What I created is as far as I can tell the only way you can split a gzipped file without prior knowledge about the actual file. If you do have prior information then other directions with a similar goal are possible: - Analyzing the file beforehand: HADOOP-6153https://issues.apache.org/jira/browse/HADOOP-6153 - Create a specially crafted gzipped file: HADOOP-7909https://issues.apache.org/jira/browse/HADOOP-7909 Over the last year I've had review comments from Chris Douglas (until he stopped being involved in Hadoop) and later from Luke Lu. Now the last feedback I got from Luke is this: Niels, I'm ambivalent about this patch. It has clean code and documentation, OTOH, it has really confusing usage/semantics and dubious general utility that the community might not want to maintain as part of an official release. After having to explain many finer points of Hadoop to new users/developers these days, I think the downside of this patch might out weight its benefits. I'm -0 on it. i.e., you need somebody else to +1 on this. So after consulting Eli I'm asking this group. My views on this feature: - I think this feature should go in because I think others can benefit from it. - I also think that it should remain disabled by default. It can then be used by those that read the documentation. - The implementation does not contain any decompression code at all. It only does the splitting smartness. (It could even be refactored to make any codec splittable). It has been tested with both the java and the native decompressors. What do you think? Is this a feature that should go in the official release or not? -- Best regards Niels Basjes The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies
Re: Gzip progress during map phase.
Yes, this is what i was looking for. Thanks -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 27 dec. 2011 12:08 schreef Koji Noguchi knogu...@yahoo-inc.com het volgende: Assuming you're using TextInputFormat, it sounds like https://issues.apache.org/jira/browse/MAPREDUCE-773 In 0.21. Don't know about CDH. Koji On 12/27/11 2:00 AM, Niels Basjes ni...@basjes.nl wrote: I would not expect this. I would expect behaviour that is independent of the way the splits are created. -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 26 dec. 2011 07:57 schreef Anthony Urso antho...@cs.ucla.edu het volgende: Gzip files (unlike uncompressed files) are not splittable, which may be causing the behavior that you described. On Dec 24, 2011 6:24 AM, Niels Basjes ni...@basjes.nl wrote: Hi, I noticed that the mapper progress indication in the hadoop cdh3 distribution jumps from 0% to 100% for each gzipped input file. So when running with big gzipped input files the job appears to be stuck. I was unable to find a jira issue that describes this effect. Before I dive into this I have a few questions to you guys: 1) is this a known effect for the 0.20 version? If so what is the jira issue? 2) is this specific to gzip? 3) is this effect still present in the MRv2/yarn version of Hadoop? Thanks. -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel )
Re: Gzip progress during map phase.
I would not expect this. I would expect behaviour that is independent of the way the splits are created. -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 26 dec. 2011 07:57 schreef Anthony Urso antho...@cs.ucla.edu het volgende: Gzip files (unlike uncompressed files) are not splittable, which may be causing the behavior that you described. On Dec 24, 2011 6:24 AM, Niels Basjes ni...@basjes.nl wrote: Hi, I noticed that the mapper progress indication in the hadoop cdh3 distribution jumps from 0% to 100% for each gzipped input file. So when running with big gzipped input files the job appears to be stuck. I was unable to find a jira issue that describes this effect. Before I dive into this I have a few questions to you guys: 1) is this a known effect for the 0.20 version? If so what is the jira issue? 2) is this specific to gzip? 3) is this effect still present in the MRv2/yarn version of Hadoop? Thanks. -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel )
Re: Which branch for my patch?
I got this: Hadoop QA commented on HADOOP-7076: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506182/HADOOP-7076-branch-0.22.patch against trunk revision . ... -1 patch. The patch command could not apply the patch. Did I do something wrong in the patch I created for branch-0.22? Or is HADOOP-7435 not yet operational? Thanks. Niels Basjes On Tue, Dec 6, 2011 at 00:17, Niels Basjes ni...@basjes.nl wrote: Hi, On Mon, Dec 5, 2011 at 18:54, Eli Collins e...@cloudera.com wrote: https://issues.apache.org/jira/browse/HADOOP-7076 Your patch based on the old structure would be useful for backporting this feature from trunk to a release with the old structure (eg 1.x, 0.22). To request inclusion in a 1.x release set the target version to 1.1.0 (and generate a patch against branch-1). To request inclusion in 0.22 set target version to 0.22.0 (and generate a patch against branch-0.22). Turns out my changes are not trivial to backport to branch-1 because the SplittableCompressionCodec interface isn't there yet. So I'm limiting to trunk (0.23.1) and branch-0.22 (0.22.0). ... Unfortunately jenkins doesn't currently run tests against non-trunk trees. For these branches you need to run test-patch (covered in the above page) and the tests yourself. Unfortunately the test-patch script mentioned on the site (dev-support/test-patch.sh /path/to/my.patch) doesn't exist in the branch-0.22. And I couldn't get the test-patch.sh that does exist to work. So I did against a clean checkout: patch -p0 HADOOP-7076-branch-0.22.patch followed by ant clean test jar Which succeeded. -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Which branch for my patch?
Hi, https://issues.apache.org/jira/browse/HADOOP-7076 Jenkins has just completed. Although it passed everything else it was '-1' because of 9 javadoc warnings that do not seem related to my patch. https://builds.apache.org/job/PreCommit-HADOOP-Build/432/artifact/trunk/hadoop-common-project/patchprocess/patchJavadocWarnings.txt Yea, these are not due to your patch. I'll bump the javadoc warnings in test-patch.properties. Thanks. Your patch based on the old structure would be useful for backporting this feature from trunk to a release with the old structure (eg 1.x, 0.22). To request inclusion in a 1.x release set the target version to 1.1.0 (and generate a patch against branch-1). To request inclusion in 0.22 set target version to 0.22.0 (and generate a patch against branch-0.22). Do I simply make separate Jira (related) issues for these backports? Nope, just set the target version field to the appropriate version and uploaded a patch, eg hadoop-7076-branch-1.patch. So that I understand correctly: - I add the targets to the Jira issue for each branch specific patch. - I create a new patch file for each version I want the feature to appear in and attach these to the issue. - I name these patches something like issue id-date-branchid.patch so that the committer can clearly see what it was intended for. Do I have to do something to ensure Jenkins will accept this all correctly? Perhaps in naming convention? Or in the timing between uploading the various patches? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Which branch for my patch?
Hi, On Wed, Nov 30, 2011 at 18:51, Eli Collins e...@cloudera.com wrote: Thanks for contributing. The nest place to contribute new features is to trunk. It's currently an easy merge from trunk to branch 23 to get it in a 23.x release (you can set the jira's target version to 23.1 to indicate this). I've just uploaded the new patch created against the trunk and set the target for 0.23.1 as you indicated. https://issues.apache.org/jira/browse/HADOOP-7076 Jenkins has just completed. Although it passed everything else it was '-1' because of 9 javadoc warnings that do not seem related to my patch. https://builds.apache.org/job/PreCommit-HADOOP-Build/432/artifact/trunk/hadoop-common-project/patchprocess/patchJavadocWarnings.txt Your patch based on the old structure would be useful for backporting this feature from trunk to a release with the old structure (eg 1.x, 0.22). To request inclusion in a 1.x release set the target version to 1.1.0 (and generate a patch against branch-1). To request inclusion in 0.22 set target version to 0.22.0 (and generate a patch against branch-0.22). Do I simply make separate Jira (related) issues for these backports? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Which branch for my patch?
Thanks, I'll get busy creating a new patch over the next few days. Niels Basjes On Wed, Nov 30, 2011 at 18:51, Eli Collins e...@cloudera.com wrote: Hey Niels, Thanks for contributing. The nest place to contribute new features is to trunk. It's currently an easy merge from trunk to branch 23 to get it in a 23.x release (you can set the jira's target version to 23.1 to indicate this). Your patch based on the old structure would be useful for backporting this feature from trunk to a release with the old structure (eg 1.x, 0.22). To request inclusion in a 1.x release set the target version to 1.1.0 (and generate a patch against branch-1). To request inclusion in 0.22 set target version to 0.22.0 (and generate a patch against branch-0.22). Thanks, Eli On Wed, Nov 30, 2011 at 8:23 AM, Niels Basjes ni...@basjes.nl wrote: Hi all, A while ago I created a feature for Hadoop and submitted this to be included (HADOOP-7076) . Around the same time the MRv2 started happening and the entire source tree was restructured. At this moment I'm prepared to change the patch I created earlier so I can submit it again for your consideration. Caused by the email about the new branches (branch-1 and branch-1.0) I'm a bit puzzled at this moment where to start. I see the mentioned branches and the trunk at probable starting points. As far as I understand the repository structure the branch-1 is the basis for the old style Hadoop and the trunk is the basis for the yarn Hadoop. For which branch of the source tree should I make my changes so you guys will reevaluate it for inclusion? Thanks. -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Which branch for my patch?
Hi all, A while ago I created a feature for Hadoop and submitted this to be included (HADOOP-7076) . Around the same time the MRv2 started happening and the entire source tree was restructured. At this moment I'm prepared to change the patch I created earlier so I can submit it again for your consideration. Caused by the email about the new branches (branch-1 and branch-1.0) I'm a bit puzzled at this moment where to start. I see the mentioned branches and the trunk at probable starting points. As far as I understand the repository structure the branch-1 is the basis for the old style Hadoop and the trunk is the basis for the yarn Hadoop. For which branch of the source tree should I make my changes so you guys will reevaluate it for inclusion? Thanks. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: making file system block size bigger to improve hdfs performance ?
Have you tried it to see what diffrence it makes? -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 3 okt. 2011 07:06 schreef Jinsong Hu jinsong...@hotmail.com het volgende: Hi, There: I just thought an idea. When we format the disk , the block size is usually 1K to 4K. For hdfs, the block size is usually 64M. I wonder if we change the raw file system's block size to something significantly bigger, say, 1M or 8M, will that improve disk IO performance for hadoop's hdfs ? Currently, I noticed that mapr distribution uses mfs, its own file system. That resulted in 4 times performance gain in terms of disk IO. I just wonder if we tune the hosting os parameters, we can achieve better disk IO performance with just the regular apache hadoop distribution. I understand that making the block size bigger can result in some disk space waste for small files. However, for disk dedicated for hdfs, where most of the files are very big, I just wonder if it is a good idea. Any body have any comment ? Jimmy
What happened to Chris?
Hi all, Over the last 7 months I've been periodically emailing with Chris Douglas (via his @apache.org account) about a new feature for Hadoop I've created (HADOOP-7076). However I've not had any response at all to my last two emails (dating back about 6 weeks and about 1 week). So I'm wondering why this is happening. Is he on a long vacation? Or is there something else? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Hadoop Annotations support
I assume you mean something like this: https://github.com/SpringSource/spring-hadoop On Sat, Jul 2, 2011 at 04:22, Raja Nagendra Kumar nagendra.r...@tejasoft.com wrote: Hi, Is there any plans for Hadoop to support annodations specially for the api level configurations eliminations.. eg. conf.setMapperClass(MaxTemperatureMapper.class); conf.setCombinerClass(MaxTemperatureReducer.class); conf.setReducerClass(MaxTemperatureReducer.class); can be easily elimiated through proper class level annotations. Regards, Raja Nagendra Kumar, C.T.O www.tejasoft.com -- View this message in context: http://old.nabble.com/Hadoop-Annotations-support-tp31977831p31977831.html Sent from the Hadoop core-dev mailing list archive at Nabble.com. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: help me to solve Exception
11/06/04 01:47:09 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/eng-zinab/inn/In (copy) could only be replicated to 0 nodes, instead of 1 Do you have a datanode running? -- Best regards / Met vriendelijke groeten, Niels Basjes
[jira] [Reopened] (HADOOP-7305) Eclipse project files are incomplete
[ https://issues.apache.org/jira/browse/HADOOP-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niels Basjes reopened HADOOP-7305: -- Apparently there are some issues with the first version in combination with OS X. Eclipse project files are incomplete Key: HADOOP-7305 URL: https://issues.apache.org/jira/browse/HADOOP-7305 Project: Hadoop Common Issue Type: Improvement Components: build Reporter: Niels Basjes Assignee: Niels Basjes Priority: Minor Fix For: 0.22.0 Attachments: HADOOP-7305-2011-05-19.patch, HADOOP-7305-2011-05-30.patch After a fresh checkout of hadoop-common I do 'ant compile eclipse'. I open eclipse, set ANT_HOME and build the project. At that point the following error appears: {quote} The type com.sun.javadoc.RootDoc cannot be resolved. It is indirectly referenced from required .class files ExcludePrivateAnnotationsJDiffDoclet.java /common/src/java/org/apache/hadoop/classification/tools line 1 Java Problem {quote} The solution is to add the tools.jar from the JDK to the buildpath/classpath. This should be fixed in the build.xml. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Eclipse target
Hi Todd, 2011/5/19 Todd Lipcon t...@cloudera.com: Yes, I have to do this same thing manually every time I re-run ant eclipse. So, it seems like we should add it to the eclipse target, like you said. Feel free to file a JIRA and patch! https://issues.apache.org/jira/browse/HADOOP-7305 (Hadoop QA should start in a few minutes to validate this). -- Met vriendelijke groeten, Niels Basjes
Re: MapReduce compilation error
Today I ran into the same error and I was puzzled by the content of this file. What is the purpose of a test file that appears to have a deliberate error and no code what so ever? 2011/3/19 Harsh J qwertyman...@gmail.com: This shouldn't really interfere with your development. You may try to exclude it from Eclipse's build, perhaps. On Sat, Mar 19, 2011 at 1:39 AM, bikash sharma sharmabiks...@gmail.com wrote: Hi, When I am compiling MapReduce source code after checking-in Eclipse, I am getting the following error: The declared package does not match the expected package testjar ClassWithNoPackage.java Hadoop-MR/src/test/mapred/testjar Any thoughts? Thanks, Bikash -- Harsh J http://harshj.com -- Met vriendelijke groeten, Niels Basjes
Eclipse target
When I checkout common, run ant eclipse and then open eclipse I get this error: The type com.sun.javadoc.RootDoc cannot be resolved. It is indirectly referenced from required .class files ExcludePrivateAnnotationsJDiffDoclet.java /common/src/java/org/apache/hadoop/classification/tools line 1 Java Problem The problem is that the tools.jar is missing during build. Now I'm extremely novice when it come to the ant build.xml but I think I solved the problem by ensuring the tools.jar is added to the eclipse project files (i.e. the .classpath file). My question to you all: is this the correct solution? Is it worth submitting as a patch? diff --git build.xml build.xml index 26ccfa0..168b34f 100644 --- build.xml +++ build.xml @@ -1571,6 +1571,7 @@ library pathref=ivy-test.classpath exported=false / variable path=ANT_HOME/lib/ant.jar exported=false / library path=${conf.dir} exported=false / +library path=${java.home}/../lib/tools.jar exported=false / /classpath /eclipse /target -- Met vriendelijke groeten, Niels Basjes
Report as a bug?
I was playing around with PMD, just to see what kind of messages it gives on my hadoop feature. I noticed a message about Dead code in org.apache.hadoop.fs.ftp.FTPFileSystem Starting at about line 80: String userAndPassword = uri.getUserInfo(); if (userAndPassword == null) { userAndPassword = (conf.get(fs.ftp.user. + host, null) + : + conf .get(fs.ftp.password. + host, null)); if (userAndPassword == null) { throw new IOException(Invalid user/passsword specified); } } The last if block is the dead code as the string will always contain at least the text : or null:null It will probably fail a bit later when really trying to login with a wrong uid/password. So, is this worth reporting as a bug? -- Met vriendelijke groeten, Niels Basjes
[jira] Created: (HADOOP-7127) Bug in login error handling in org.apache.hadoop.fs.ftp.FTPFileSystem
Bug in login error handling in org.apache.hadoop.fs.ftp.FTPFileSystem - Key: HADOOP-7127 URL: https://issues.apache.org/jira/browse/HADOOP-7127 Project: Hadoop Common Issue Type: Bug Components: fs Reporter: Niels Basjes I was playing around with PMD, just to see what kind of messages it gives on hadoop. I noticed a message about Dead code in org.apache.hadoop.fs.ftp.FTPFileSystem Starting at about line 80: String userAndPassword = uri.getUserInfo(); if (userAndPassword == null) { userAndPassword = (conf.get(fs.ftp.user. + host, null) + : + conf .get(fs.ftp.password. + host, null)); if (userAndPassword == null) { throw new IOException(Invalid user/passsword specified); } } The last if block is the dead code as the string will always contain at least the text : or null:null This means that the error handling fails to work as intended. It will probably fail a bit later when really trying to login with a wrong uid/password. P.S. Fix the silly typo passsword in the exception message too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [Hadoop Wiki] Update of FrontPage by prosch
Seems like a good moment to blacklist the prosch user from ever changing the wiki again. 2011/1/10 Apache Wiki wikidi...@apache.org: Dear Wiki user, You have subscribed to a wiki page or wiki category on Hadoop Wiki for change notification. The FrontPage page has been changed by prosch. http://wiki.apache.org/hadoop/FrontPage?action=diffrev1=175rev2=176 -- == General Information == * [[http://hadoop.apache.org/|Official Apache Hadoop Website]]: download, bug-tracking, mailing-lists, etc. * [[ProjectDescription|Overview]] of Apache Hadoop - * [[FAQ]] + * [[FAQ]] [[http://www.profi-fachuebersetzung.de/language-translation.html|Translation agency]] / [[http://www.profischnell.com|Übersetzung Polnisch Deutsch]] * [[HadoopIsNot|What Hadoop is not]] * [[Distributions and Commercial Support|Distributions and Commercial Support]] for Hadoop (RPMs, Debs, AMIs, etc) * [[HadoopPresentations|Presentations]], [[Books|books]], [[HadoopArticles|articles]] and [[Papers|papers]] about Hadoop * PoweredBy, a list of sites and applications powered by Apache Hadoop * Support * [[Help|Getting help from the hadoop community]]. - * [[Support|People and companies for hire]]. + * [[Support|People and companies for hire]]. * [[Conferences|Hadoop Community Events and Conferences]] * HadoopUserGroups (HUGs) * HadoopSummit -- Met vriendelijke groeten, Niels Basjes
Ready for review?
Hi, I consider the patch I created to be ready for review by a code reviewer. It does what I want it to do and Hudson gives an overall +1. The http://wiki.apache.org/hadoop/HowToContribute was unclear to me on what to do next. So, what should I do? Simply wait / change the stated of the issue / ... https://issues.apache.org/jira/browse/HADOOP-7076 -- Met vriendelijke groeten, Niels Basjes
Re: Ready for review?
Hi Jakob, Thanks for clarifying this to me. Niels 2011/1/7 Jakob Homan jgho...@gmail.com: Niels- Thanks for the contribution. For the moment your task is done. Now it's up to a committer to review the patch and, either provide you with feedback for its improvement, or commit it. It's in the patch available state, which is the flag for reviewers to know there's work for them to do. Since this is a volunteer effort, I'm afraid there's no firm timeline for when this will get done. -Jakob On Fri, Jan 7, 2011 at 6:50 AM, Niels Basjes ni...@basjes.nl wrote: Hi, I consider the patch I created to be ready for review by a code reviewer. It does what I want it to do and Hudson gives an overall +1. The http://wiki.apache.org/hadoop/HowToContribute was unclear to me on what to do next. So, what should I do? Simply wait / change the stated of the issue / ... https://issues.apache.org/jira/browse/HADOOP-7076 -- Met vriendelijke groeten, Niels Basjes -- Met vriendelijke groeten, Niels Basjes
Jira workflow problem.
I seem to have selected the wrong option in Jira to get the latest patch handled. For some reason the option to indicate a new patch has been made available is no longer present. https://issues.apache.org/jira/browse/HADOOP-7076 What did I do wrong and what can I do to fix this? Thanks. -- Met vriendelijke groeten, Niels Basjes
Re: Jira workflow problem.
Thanks guys, I really appreciate your help. Niels Basjes 2011/1/6 Doug Cutting cutt...@apache.org: The problem was that the submitter could transition from Patch Available to In Progress, following the Resume Progress transition, but the submitter cannot then transtion anywhere from In Progress, only the assignee could. I fixed that, so that the assignee can no longer follow that transition to a cul de sac. I also made you a contributor so you could be assigned issues, and assigned you this issue. Doug On 01/06/2011 01:41 PM, Niels Basjes wrote: I seem to have selected the wrong option in Jira to get the latest patch handled. For some reason the option to indicate a new patch has been made available is no longer present. https://issues.apache.org/jira/browse/HADOOP-7076 What did I do wrong and what can I do to fix this? Thanks. -- Met vriendelijke groeten, Niels Basjes
Build failed, hudson broken?
Hi, I just submitted a patch for the feature I've been working on. https://issues.apache.org/jira/browse/HADOOP-7076 This patch works fine on my system and passes all the unit tests. Now some 30 minutes later it seems the build on the hudson has failed. https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/155/ I'm not sure but to me it seems that there are issues with the hudson. None of the errors in the log are related to my fixes and not only my build (155) but also builds 154 and 153 have failed with errors that are (at first glance) the same. Someone here knows how/where to get this problem fixed? -- Met vriendelijke groeten, Niels Basjes
Re: Build failed, hudson broken?
I found where to report this ... so I did: https://issues.apache.org/jira/browse/INFRA-3340 2011/1/5 Niels Basjes ni...@basjes.nl: Hi, I just submitted a patch for the feature I've been working on. https://issues.apache.org/jira/browse/HADOOP-7076 This patch works fine on my system and passes all the unit tests. Now some 30 minutes later it seems the build on the hudson has failed. https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/155/ I'm not sure but to me it seems that there are issues with the hudson. None of the errors in the log are related to my fixes and not only my build (155) but also builds 154 and 153 have failed with errors that are (at first glance) the same. Someone here knows how/where to get this problem fixed? -- Met vriendelijke groeten, Niels Basjes -- Met vriendelijke groeten, Niels Basjes
What is the correct spelling?
Hi, I noticed that the isSplitable method (and a bunch of other places in the Hadoop codebase) writes splitable where I would have expected splittable (two 't' instead of one). All spelling functionality I have indicates the 'double t' version is correct. Should I correct this in the junit test files I'm touching? -- Best regards, Niels Basjes
[jira] Created: (HADOOP-7076) Splittable Gzip
Splittable Gzip --- Key: HADOOP-7076 URL: https://issues.apache.org/jira/browse/HADOOP-7076 Project: Hadoop Common Issue Type: New Feature Components: io Reporter: Niels Basjes Files compressed with the gzip codec are not splittable due to the nature of the codec. This limits the options you have scaling out when reading large gzipped input files. Given the fact that gunzipping a 1GiB file usually takes only 2 minutes I figured that for some use cases wasting some resources may result in a shorter job time under certain conditions. So reading the entire input file from the start for each split (wasting resources!!) may lead to additional scalability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.