Re: Checkstyle 80 char limit

2015-05-05 Thread Niels Basjes
Perhaps we should tell him these screens can also be turned landscape ? ;)

But seriously:
1) Does Doug still actively work on code? From my perspective; Only very
infrequently.
2) In the 200+ people IT department where I work I know only 1 colleague
who uses his screen in portrait mode and he doesn't do code.

So, for who do we really stick to the 80 chars limit?

Niels

On Tue, May 5, 2015 at 8:11 PM, Allen Wittenauer a...@altiscale.com wrote:


 On May 5, 2015, at 11:05 AM, Rich Haase rha...@pandora.com wrote:

  Can someone explain to me why on earth we care about limiting line
 length to 80 characters?  Are there hadoop developers out there working
 from teletype terminals?  Can we perhaps update this limit to something
 sane, like 120 chars?
 



 http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201407.mbox/%3CCALEq1Z8QvHof1A3zO0W5WGfbNjCOpfNo==jktq8jiu6efm_...@mail.gmail.com%3E





-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Checkstyle 80 char limit

2015-05-05 Thread Niels Basjes
Quote from that page (i.e. the big block in RED at the top):

The information on this page is for Archive Purposes Only
This page is not being actively maintained.
Links within the documentation may not work and the information itself may
no longer be valid.
The last revision to this document was made on April 20, 1999


So the 80 chars thing was at best reconsidered 16 years ago.

Things have changed ...

Niels Basjes


On Tue, May 5, 2015 at 8:21 PM, Jonathan Eagles jeag...@gmail.com wrote:

 More formally, we follow the sun java coding standards which follow
 the 80 character limit. There is recent discussion over this, so it is
 a very relevant comment.

 http://wiki.apache.org/hadoop/HowToContribute

 http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-136091.html#313

 On Tue, May 5, 2015 at 1:11 PM, Allen Wittenauer a...@altiscale.com wrote:
 
  On May 5, 2015, at 11:05 AM, Rich Haase rha...@pandora.com wrote:
 
  Can someone explain to me why on earth we care about limiting line
 length to 80 characters?  Are there hadoop developers out there working
 from teletype terminals?  Can we perhaps update this limit to something
 sane, like 120 chars?
 
 
 
 
 http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201407.mbox/%3CCALEq1Z8QvHof1A3zO0W5WGfbNjCOpfNo==jktq8jiu6efm_...@mail.gmail.com%3E
 
 




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Checkstyle 80 char limit

2015-05-05 Thread Niels Basjes
 I respect immensely the contributions Doug has made to many Apache
projects,

Hear hear! I fully agree with that one.

On Tue, May 5, 2015 at 8:30 PM, Rich Haase rha...@pandora.com wrote:

 I respect immensely the contributions Doug has made to many Apache
 projects, but I don’t think that should be a reason to force everyone to
 write code as if our screen sizes can’t support more than 80 characters.

 On May 5, 2015, at 12:21 PM, Niels Basjes ni...@basjes.nlmailto:
 ni...@basjes.nl wrote:

 Perhaps we should tell him these screens can also be turned landscape ? ;)

 But seriously:
 1) Does Doug still actively work on code? From my perspective; Only very
 infrequently.
 2) In the 200+ people IT department where I work I know only 1 colleague
 who uses his screen in portrait mode and he doesn't do code.

 So, for who do we really stick to the 80 chars limit?

 Niels

 On Tue, May 5, 2015 at 8:11 PM, Allen Wittenauer a...@altiscale.commailto:
 a...@altiscale.com wrote:


 On May 5, 2015, at 11:05 AM, Rich Haase rha...@pandora.commailto:
 rha...@pandora.com wrote:

 Can someone explain to me why on earth we care about limiting line
 length to 80 characters?  Are there hadoop developers out there working
 from teletype terminals?  Can we perhaps update this limit to something
 sane, like 120 chars?





 http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201407.mbox/%3CCALEq1Z8QvHof1A3zO0W5WGfbNjCOpfNo==jktq8jiu6efm_...@mail.gmail.com%3E





 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes

 Rich Haase| Sr. Software Engineer | Pandora
 m (303) 887-1146 |  rha...@pandora.commailto:awils...@pandora.com






-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: IMPORTANT: testing patches for branches

2015-04-22 Thread Niels Basjes
Perhaps a script is due that creates the patch file with -exactly- the
right name.
Something like what HBase has as dev-support/make_patch.sh perhaps?

On Wed, Apr 22, 2015 at 10:30 PM, Allen Wittenauer a...@altiscale.com wrote:


 Oh, this is also in the release notes, but one can use a git reference #
 as well. :) (with kudos to OOM for the idea.)

 On Apr 22, 2015, at 8:57 PM, Allen Wittenauer a...@altiscale.com wrote:

 
  More than likely. It probably needs more testing (esp under Jenkins).
 
  It should be noted that the code in test-patch.sh has lots of problems
 with branch-0, minor, and micro releases.  But for major releases, it seems
 to work well for me. :)
 
  On Apr 22, 2015, at 8:45 PM, Vinod Kumar Vavilapalli 
 vino...@hortonworks.com wrote:
 
  Does this mean HADOOP-7435 is no longer needed / closeable as dup?
 
  Thanks
  +Vinod
 
  On Apr 22, 2015, at 12:34 PM, Allen Wittenauer a...@altiscale.com
 wrote:
 
 
  Hey gang,
 
  Just so everyone is aware, if you are working on a patch for
 either a feature branch or a major branch, if you name the patch with the
 branch name following the spec in HowToContribute (and a few other ways…
 test-patch tries to figure it out!), test-patch.sh *should* be switching
 the repo over to that branch for testing.
 
  For example,  naming a patch foo-branch-2.01.patch should get
 tested on branch-2.  Naming a patch foo-HDFS-7285.00.patch should get
 tested on the HDFS-7285 branch.
 
  This hopefully means that there should really be no more ‘blind’
 +1’s to patches that go to branches.  The “we only test against trunk”
 argument is no longer valid. :)
 
 
 
 




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


[jira] [Created] (HADOOP-11843) Make setting up the build environment easier

2015-04-17 Thread Niels Basjes (JIRA)
Niels Basjes created HADOOP-11843:
-

 Summary: Make setting up the build environment easier
 Key: HADOOP-11843
 URL: https://issues.apache.org/jira/browse/HADOOP-11843
 Project: Hadoop Common
  Issue Type: New Feature
Reporter: Niels Basjes
Assignee: Niels Basjes


( As discussed with [~aw] )
In AVRO-1537 a docker based solution was created to setup all the tools for 
doing a full build. This enables much easier reproduction of any issues and 
getting up and running for new developers.

This issue is to 'copy/port' that setup into the hadoop project in preparation 
for the bug squash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: NFSv3 Filesystem Connector

2015-01-14 Thread Niels Basjes
Hi,

The main reason Hadoop scales so well is because all components try to
adhere to the idea around having Data Locality.
In general this means that you are running the processing/query software on
the system where the actual data is already present on the local disk.

To me this NFS solution sounds like hooking the processing nodes to a
shared storage solution.
This may work for small clusters (say 5 nodes or so) but for large clusters
this shared storage will be the main bottle neck in the processing/query
speed.

We currently have more than 20 nodes with 12 harddisks each resulting in
over 50GB/sec [1] of disk-to-queryengine speed and this means that our
setup already goes much faster than any network connection to any NFS
solution can provide. We can simply go to say 50 nodes and exceed the
100GB/sec speed easy.

So to me this sounds like hooking a scalable processing platform to a non
scalable storage system (mainly because the network to this storage doesn't
scale).

So far I have only seen vendors of legacy storage solutions going in this
direction ... oh wait ... you are NetApp ... that explains it.

I am no committer in any of the Hadoop tools but I vote against having such
a core concept breaking piece in the main codebase. New people may start
to think it is a good idea to do this.

So I say you should simply make this plugin available to your customers,
just not as a core part of Hadoop.

Niels Basjes

[1]  50 GB/sec = approx 20*12*200MB/sec
  This page shows max read speed in the 200MB/sec range:

http://www.tomshardware.com/charts/enterprise-hdd-charts/-02-Read-Throughput-Maximum-h2benchw-3.16,3372.html


On Tue, Jan 13, 2015 at 10:35 PM, Gokul Soundararajan 
gokulsoun...@gmail.com wrote:

 Hi,

 We (Jingxin Feng, Xing Lin, and I) have been working on providing a
 FileSystem implementation that allows Hadoop to utilize a NFSv3 storage
 server as a filesystem. It leverages code from hadoop-nfs project for all
 the request/response handling. We would like your help to add it as part of
 hadoop tools (similar to the way hadoop-aws and hadoop-azure).

 In more detail, the Hadoop NFS Connector allows Apache Hadoop (2.2+) and
 Apache Spark (1.2+) to use a NFSv3 storage server as a storage endpoint.
 The NFS Connector can be run in two modes: (1) secondary filesystem - where
 Hadoop/Spark runs using HDFS as its primary storage and can use NFS as a
 second storage endpoint, and (2) primary filesystem - where Hadoop/Spark
 runs entirely on a NFSv3 storage server.

 The code is written in a way such that existing applications do not have to
 change. All one has to do is to copy the connector jar into the lib/
 directory of Hadoop/Spark. Then, modify core-site.xml to provide the
 necessary details.

 The current version can be seen at:
 https://github.com/NetApp/NetApp-Hadoop-NFS-Connector

 It is my first time contributing to the Hadoop codebase. It would be great
 if someone on the Hadoop team can guide us through this process. I'm
 willing to make the necessary changes to integrate the code. What are the
 next steps? Should I create a JIRA entry?

 Thanks,

 Gokul




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Updates on migration to git

2014-08-26 Thread Niels Basjes
Hi,

Great to see the move towards git.

In terms of documentation could you please include the way binary files
should be included in a patch (see this discussion
https://www.mail-archive.com/common-dev%40hadoop.apache.org/msg13166.html )
and update http://wiki.apache.org/hadoop/GitAndHadoop too (this one still
talks about the time when there were 3 projects).

Thanks.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Deprecated configuration settings set from the core code / {core,hdfs,...}-default.xml ??

2014-08-21 Thread Niels Basjes
Hi,

I found this because I was wondering why simply starting something as
trivial as the pig grunt gives the following messages during startup:

2014-08-21 09:36:55,171 [main] INFO
 org.apache.hadoop.conf.Configuration.deprecation - *mapred.job.tracker is
deprecated*. Instead, use mapreduce.jobtracker.address
2014-08-21 09:36:55,172 [main] INFO
 org.apache.hadoop.conf.Configuration.deprecation - *fs.default.name
http://fs.default.name is deprecated*. Instead, use fs.defaultFS

What I found is that these settings are not part of my config but they are
part of the 'core hadoop' files.

I found that the mapred.job.tracker is set from code when using the mapred
package (probably this is what pig uses)
https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/JobClient.java#L869

and that the fs.default.name is explicitly defined here as 'deprecated' in
one of the *-default.xml config files.
https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/resources/core-default.xml#L524

I did some more digging and found that there are several other properties
that have been defined as deprecated that are still present in the various
*-default.xml files throughout the hadoop code base.

I used this list as a reference:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/site/apt/DeprecatedProperties.apt.vm

The ones I found so far:
./hadoop-common-project/hadoop-common/src/main/resources/core-default.xml:
 namefs.default.name/name
./hadoop-common-project/hadoop-common/src/main/resources/core-default.xml:
 nameio.bytes.per.checksum/name
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml:
namemapreduce.job.counters.limit/name
./hadoop-tools/hadoop-distcp/src/main/resources/distcp-default.xml:
 namemapred.job.map.memory.mb/name
./hadoop-tools/hadoop-distcp/src/main/resources/distcp-default.xml:
 namemapred.job.reduce.memory.mb/name
./hadoop-tools/hadoop-distcp/src/main/resources/distcp-default.xml:
 namemapreduce.reduce.class/name

Seems to me fixing these removes a lot of senseless clutter in the
messaging in the console for end users.

Or is there a good reason to keep it like this?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Hortonworks scripting ...

2014-08-14 Thread Niels Basjes
Hi,

In the core Hadoop you can on your (desktop) client have multiple clusters
available simply by having multiple directories with setting files (i.e.
core-site.xml etc.) and select the one you want by changing the environment
settings (i.e. HADOOP_CONF_DIR and such) around.

This doesn't work when I run under the Hortonworks 2.1.2 distribution.

There I find that in all of the scripts placed in /usr/bin/ there is
mucking about with the environment settings. Things from /etc/default are
sourced and they override my settings.
Now I can control part of it by directing the BIGTOP_DEFAULTS_DIR into a
blank directory.
But in /usr/bin/pig sourcing /etc/default/hadoop hardcoded into the script.

Why is this done this way?

P.S. Where is the git(?) repo located where this (apperently HW specific)
scripting is maintained?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Change proposal for FileInputFormat isSplitable

2014-07-30 Thread Niels Basjes
Hi,

I talked to some people and they agreed with me that really the situation
where this problem occurs is when they build a FileInputFormat derivative
that also uses a LineRecordReader derivative. This is exactly the scenario
that occurs if someone follows the Yahoo Hadoop tutorial.

Instead of changing the FileInputFormat (which many of the committers
considered to be a bad idea) I created a very simple patch for the
LineRecordReader that throws an exception (intentionally failing the entire
job) when it receives a split for a compressed file that had not been
compressed using a SplittableCompressionCodec and where the split does not
start at the beginning of the file. So fail if it detects a non splittable
file that has been split.

So if you run this against a 1GB gzipped file then the first split of the
whole will complete successfully and all other splits will fail without
even reading a single line.

As far as I can tell this is a simple, clean and compatible patch that does
not break anything. Also the change is limited to the most common place
where this problem occurs.
The only 'big' effect is that people who have been running a broken
implementation will no longer be able to run this broken code iff they feed
it 'large' non-splittable files. Which I thinks is a good thing.

What do you (the committers) think of this approach?

The patch I submitted a few days ago also includes the JavaDoc improvements
(in FileInputFormat) provided by Gian Merlino

https://issues.apache.org/jira/browse/MAPREDUCE-2094

Niels Basjes

P.S. I still thing that the FileInputFormat.isSplitable() should implement
a safe default instead of an optimistic default.



On Sat, Jun 14, 2014 at 10:33 AM, Niels Basjes ni...@basjes.nl wrote:

 I did some digging through the code base and inspected all the situations
 I know where this goes wrong (including the yahoo tutorial) and found a
 place that may be a spot to avoid the effects of this problem. (Instead of
 solving the cause the problem)

 It turns out that all of those use cases use the LineRecordReader to read
 the data. This class (both the mapred and mapreduce versions) have the
 notion of the split that needs to be read, if the file is compressed and if
 this is a splittable compression codec.

 Now if we were to add code there that validates if the provided splits are
 valid or not (i.e. did the developer make this bug or not) then we could
 avoid the garbage data problem before it is fed into the actual mapper.
 This must  then write error messages (+ message did you know you have been
 looking at corrupted data for a long time) that will appear in the logs of
 all the mapper attempts.

 At that point we can do one of these two actions in the LineRecordReader:
 - Fail hard with an exception. The job fails and the user immediately goes
 to the developer of the inputformat with a bug report.
 - Avoid the problem: Read the entire file iff the start of the split is 0,
 else read nothing. Many users will see a dramatic change in their results
 and (hopefully) start digging deeper. (Iff a human actually looks at the
 data)

 I vote for the fail hard because then people are forced to fix the
 problem and correct the historical impact.

 Would this be a good / compatible solution?

 If so then I think we should have this in both the 2.x and 3.x.

 For the 3.x I also realized that perhaps the isSplittable is something
 that could be delegated to the record reader. Would that make sense or is
 this something that does not belong there?
 If not then I would still propose making the isSplittable abstract to fix
 the problem before it is created (in 3.x)

 Niels Basjes
 On Jun 13, 2014 11:47 PM, Chris Douglas cdoug...@apache.org wrote:

 On Fri, Jun 13, 2014 at 2:54 AM, Niels Basjes ni...@basjes.nl wrote:
  Hmmm, people only look at logs when they have a problem. So I don't
 think
  this would be enough.

 This change to the framework will cause disruptions to users, to aid
 InputFormat authors' debugging. The latter is a much smaller
 population and better equipped to handle this complexity.

 A log statement would print during submission, so it would be visible
 to users. If a user's job is producing garbage but submission was
 non-interactive, a log statement would be sufficient to debug the
 issue. If the naming conflict is common in some contexts, the warning
 can be disabled using the log configuration.

 Beyond that, input validation is the responsibility of the InputFormat
 author.

  Perhaps this makes sense:
  - For 3.0: Shout at the developer who does it wrong (i.e. make it
 abstract
  and force them to think about this) i.e. Create new abstract method
  isSplittable (tt) in FileInputFormat, remove isSplitable (one t).
 
  To avoid needless code duplication (which we already have in the
 codebase)
  create a helper method something like 'fileNameIndicatesSplittableFile'
 (
  returns enum:  Splittable/NonSplittable/Unknown ).
 
  - For 2.x: Keep the enduser safe

Re: Jenkins problem or patch problem?

2014-07-29 Thread Niels Basjes
I think this behavior is better.
This way you know you patch was not (fully) applied.

It would be even better if there was a way to submit a patch with a binary
file in there.

Niels


On Mon, Jul 28, 2014 at 11:29 PM, Andrew Wang andrew.w...@cloudera.com
wrote:

 I had the same issue on HDFS-6696, patch generated with git diff
 --binary. I ended up making the same patch without the binary part and it
 could be applied okay.

 This does differ in behavior from the old boxes, which were still able to
 apply the non-binary parts of a binary-diff.


 On Mon, Jul 28, 2014 at 3:06 AM, Niels Basjes ni...@basjes.nl wrote:

  For my test case I needed a something.txt.gz file
  However for this specific test this file will never be actually read, it
  just has to be there and it must be a few bytes in size.
  Because binary files do't work I simply created a file containging Hello
  world
  Now this isn't a gzip file at all, yet for my test it does enough to make
  the test work as intended.
 
  So in fact I didn't solve the binary attachment problem at all.
 
 
  On Mon, Jul 28, 2014 at 1:40 AM, Ted Yu yuzhih...@gmail.com wrote:
 
   Mind telling us how you included the binary file in your svn patch ?
  
   Thanks
  
  
   On Sun, Jul 27, 2014 at 12:27 PM, Niels Basjes ni...@basjes.nl
 wrote:
  
I created a patch file with SVN and it works now.
I dare to ask: Are there any git created patch files that work?
   
   
On Sun, Jul 27, 2014 at 9:44 PM, Niels Basjes ni...@basjes.nl
 wrote:
   
 I'll look for a workaround regarding the binary file. Thanks.


 On Sun, Jul 27, 2014 at 9:07 PM, Ted Yu yuzhih...@gmail.com
 wrote:

 Similar problem has been observed for HBase patches.

 Have you tried attaching level 1 patch ?
 For the binary file, to my knowledge, 'git apply' is able to
 handle
  it
but
 hadoop is currently using svn.

 Cheers


 On Sun, Jul 27, 2014 at 11:01 AM, Niels Basjes ni...@basjes.nl
   wrote:

  Hi,
 
  I just submitted a patch and Jenkins said it failed to apply the
patch.
  But when I look at the console output
 
 
   https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console
 
  it says:
 
  At revision 1613826.
  MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44
  UTC
  2014 fromhttp://
 

   
  
 
 issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp
  :
  cannot stat '/home/jenkins/buildSupport/lib/*': No such file or
  directory
  *The patch does not appear to apply with p0 to p2
  PATCH APPLICATION FAILED
 
 
  Now I do have a binary file (for the unit test) in this patch,
perhaps I
  did something wrong? Or is this problem caused by the error I
 highlighted?
 
  What can I do to fix this?
 
  --
  Best regards / Met vriendelijke groeten,
 
  Niels Basjes
 




 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes

   
   
   
--
Best regards / Met vriendelijke groeten,
   
Niels Basjes
   
  
 
 
 
  --
  Best regards / Met vriendelijke groeten,
 
  Niels Basjes
 




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Jenkins problem or patch problem?

2014-07-28 Thread Niels Basjes
For my test case I needed a something.txt.gz file
However for this specific test this file will never be actually read, it
just has to be there and it must be a few bytes in size.
Because binary files do't work I simply created a file containging Hello
world
Now this isn't a gzip file at all, yet for my test it does enough to make
the test work as intended.

So in fact I didn't solve the binary attachment problem at all.


On Mon, Jul 28, 2014 at 1:40 AM, Ted Yu yuzhih...@gmail.com wrote:

 Mind telling us how you included the binary file in your svn patch ?

 Thanks


 On Sun, Jul 27, 2014 at 12:27 PM, Niels Basjes ni...@basjes.nl wrote:

  I created a patch file with SVN and it works now.
  I dare to ask: Are there any git created patch files that work?
 
 
  On Sun, Jul 27, 2014 at 9:44 PM, Niels Basjes ni...@basjes.nl wrote:
 
   I'll look for a workaround regarding the binary file. Thanks.
  
  
   On Sun, Jul 27, 2014 at 9:07 PM, Ted Yu yuzhih...@gmail.com wrote:
  
   Similar problem has been observed for HBase patches.
  
   Have you tried attaching level 1 patch ?
   For the binary file, to my knowledge, 'git apply' is able to handle it
  but
   hadoop is currently using svn.
  
   Cheers
  
  
   On Sun, Jul 27, 2014 at 11:01 AM, Niels Basjes ni...@basjes.nl
 wrote:
  
Hi,
   
I just submitted a patch and Jenkins said it failed to apply the
  patch.
But when I look at the console output
   
   
 https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console
   
it says:
   
At revision 1613826.
MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC
2014 fromhttp://
   
  
 
 issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp
:
cannot stat '/home/jenkins/buildSupport/lib/*': No such file or
directory
*The patch does not appear to apply with p0 to p2
PATCH APPLICATION FAILED
   
   
Now I do have a binary file (for the unit test) in this patch,
  perhaps I
did something wrong? Or is this problem caused by the error I
   highlighted?
   
What can I do to fix this?
   
--
Best regards / Met vriendelijke groeten,
   
Niels Basjes
   
  
  
  
  
   --
   Best regards / Met vriendelijke groeten,
  
   Niels Basjes
  
 
 
 
  --
  Best regards / Met vriendelijke groeten,
 
  Niels Basjes
 




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Jenkins problem or patch problem?

2014-07-27 Thread Niels Basjes
Hi,

I just submitted a patch and Jenkins said it failed to apply the patch.
But when I look at the console output

https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console

it says:

At revision 1613826.
MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC
2014 
fromhttp://issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp:
cannot stat '/home/jenkins/buildSupport/lib/*': No such file or
directory
*The patch does not appear to apply with p0 to p2
PATCH APPLICATION FAILED


Now I do have a binary file (for the unit test) in this patch, perhaps I
did something wrong? Or is this problem caused by the error I highlighted?

What can I do to fix this?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Jenkins problem or patch problem?

2014-07-27 Thread Niels Basjes
There are several other jobs (completely unrelated to my patch) that failed
with exactly the same error.
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4770/console
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4769/console
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4768/console

So I say Jenkins problem for now.


On Sun, Jul 27, 2014 at 9:01 PM, Niels Basjes ni...@basjes.nl wrote:

 Hi,

 I just submitted a patch and Jenkins said it failed to apply the patch.
 But when I look at the console output

 https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console

 it says:

 At revision 1613826.
 MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC 2014 
 fromhttp://issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp:
  cannot stat '/home/jenkins/buildSupport/lib/*': No such file or directory
 *The patch does not appear to apply with p0 to p2
 PATCH APPLICATION FAILED


 Now I do have a binary file (for the unit test) in this patch, perhaps I
 did something wrong? Or is this problem caused by the error I highlighted?

 What can I do to fix this?

 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Jenkins problem or patch problem?

2014-07-27 Thread Niels Basjes
I'll look for a workaround regarding the binary file. Thanks.


On Sun, Jul 27, 2014 at 9:07 PM, Ted Yu yuzhih...@gmail.com wrote:

 Similar problem has been observed for HBase patches.

 Have you tried attaching level 1 patch ?
 For the binary file, to my knowledge, 'git apply' is able to handle it but
 hadoop is currently using svn.

 Cheers


 On Sun, Jul 27, 2014 at 11:01 AM, Niels Basjes ni...@basjes.nl wrote:

  Hi,
 
  I just submitted a patch and Jenkins said it failed to apply the patch.
  But when I look at the console output
 
  https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console
 
  it says:
 
  At revision 1613826.
  MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC
  2014 fromhttp://
 
 issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp
  :
  cannot stat '/home/jenkins/buildSupport/lib/*': No such file or
  directory
  *The patch does not appear to apply with p0 to p2
  PATCH APPLICATION FAILED
 
 
  Now I do have a binary file (for the unit test) in this patch, perhaps I
  did something wrong? Or is this problem caused by the error I
 highlighted?
 
  What can I do to fix this?
 
  --
  Best regards / Met vriendelijke groeten,
 
  Niels Basjes
 




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Jenkins problem or patch problem?

2014-07-27 Thread Niels Basjes
I created a patch file with SVN and it works now.
I dare to ask: Are there any git created patch files that work?


On Sun, Jul 27, 2014 at 9:44 PM, Niels Basjes ni...@basjes.nl wrote:

 I'll look for a workaround regarding the binary file. Thanks.


 On Sun, Jul 27, 2014 at 9:07 PM, Ted Yu yuzhih...@gmail.com wrote:

 Similar problem has been observed for HBase patches.

 Have you tried attaching level 1 patch ?
 For the binary file, to my knowledge, 'git apply' is able to handle it but
 hadoop is currently using svn.

 Cheers


 On Sun, Jul 27, 2014 at 11:01 AM, Niels Basjes ni...@basjes.nl wrote:

  Hi,
 
  I just submitted a patch and Jenkins said it failed to apply the patch.
  But when I look at the console output
 
  https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4771//console
 
  it says:
 
  At revision 1613826.
  MAPREDUCE-2094 patch is being downloaded at Sun Jul 27 18:50:44 UTC
  2014 fromhttp://
 
 issues.apache.org/jira/secure/attachment/12658034/MAPREDUCE-2094-20140727.patch*cp
  :
  cannot stat '/home/jenkins/buildSupport/lib/*': No such file or
  directory
  *The patch does not appear to apply with p0 to p2
  PATCH APPLICATION FAILED
 
 
  Now I do have a binary file (for the unit test) in this patch, perhaps I
  did something wrong? Or is this problem caused by the error I
 highlighted?
 
  What can I do to fix this?
 
  --
  Best regards / Met vriendelijke groeten,
 
  Niels Basjes
 




 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Change proposal for FileInputFormat isSplitable

2014-06-13 Thread Niels Basjes
Hi,

On Wed, Jun 11, 2014 at 8:25 PM, Chris Douglas cdoug...@apache.org wrote:

 On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes ni...@basjes.nl wrote:
  That's not what I meant. What I understood from what was described is
 that
  sometimes people use an existing file extension (like .gz) for a file
 that
  is not a gzipped file.



 Understood, but this change also applies to other loaded codecs, like
 .lzo, .bz, etc. Adding a new codec changes the default behavior for
 all InputFormats that don't override this method.


Yes it would. I think that forcing the developer of the file based
inputformat to implement this would be the best way to go.
Making this method abstract is the first thing that spring to mind.

This would break backwards compatibility so I think we can only do that
with the 3.0.0 version


  I consider silently producing garbage one of the worst kinds of problem
  to tackle.
  Because many custom file based input formats have stumbled (getting
  silently produced garbage) over the current isSplitable implementation
 I
  really want to avoid any more of this in the future.
  That is why I want to change the implementations in this area of Hadoop
 in
  such a way that this silently producing garbage effect is taken out.

 Adding validity assumptions to a common base class will affect a lot
 of users, most of whom are not InputFormat authors.


True, the thing is that if a user uses an InputFormat written by someone
else and then it silently produces garbage they are also affected in a
much worse way.


  So the question remains: What is the way this should be changed?
  I'm willing to build it and submit a patch.

 Would a logged warning suffice? This would aid debugging without an
 incompatible change in behavior. It could also be disabled easily. -C


Hmmm, people only look at logs when they have a problem. So I don't think
this would be enough.

Perhaps this makes sense:
- For 3.0: Shout at the developer who does it wrong (i.e. make it abstract
and force them to think about this) i.e. Create new abstract method
isSplittable (tt) in FileInputFormat, remove isSplitable (one t).

To avoid needless code duplication (which we already have in the codebase)
create a helper method something like 'fileNameIndicatesSplittableFile' (
returns enum:  Splittable/NonSplittable/Unknown ).

- For 2.x: Keep the enduser safe: Avoid silently producing garbage in all
situations where the developer already did it wrong. (i.e. change
isSplitable == return false) This costs performance only in those
situations where the developer actually did it wrong (i.e. they didn't
thing this through)

How about that?

P.S. I created an issue for the NLineInputFormat problem I found:
https://issues.apache.org/jira/browse/MAPREDUCE-5925

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Change proposal for FileInputFormat isSplitable

2014-06-11 Thread Niels Basjes
On Tue, Jun 10, 2014 at 8:10 PM, Chris Douglas cdoug...@apache.org wrote:

 On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes ni...@basjes.nl wrote:
  and if you then give the file the .gz extension this breaks all common
  sense / conventions about file names.



 That the suffix for all compression codecs in every context- and all
 future codecs- should determine whether a file can be split is not an
 assumption we can make safely. Again, that's not an assumption that
 held when people built their current systems, and they would be justly
 annoyed with the project for changing it.


That's not what I meant. What I understood from what was described is that
sometimes people use an existing file extension (like .gz) for a file that
is not a gzipped file.
If a file is splittable or not depends greatly on the actual codec
implementation that is used to read it. Using the default GzipCodec a .gz
file is not splittable, but that can be changed with a different
implementation like for example this
https://github.com/nielsbasjes/splittablegzip
So given a file extension the file 'must' be a file that is the format that
is described by the file name extension.

The flow is roughly as follows
- What is the file extension
- Get the codec class registered to that extension
- Is this a splittable codec ? (Does this class implement the
splittablecodec interface)

 I hold correct data much higher than performance and scalability; so the
  performance impact is a concern but it is much less important than the
 list
  of bugs we are facing right now.

 These are not bugs. NLineInputFormat doesn't support compressed input,
 and why would it? -C


I'm not saying it should (in fact, for this one I agree that it shouldn't).
The reality is that it accepts the file, decompresses it and then produces
output that 'looks good' but really is garbage.

I consider silently producing garbage one of the worst kinds of problem
to tackle.
Because many custom file based input formats have stumbled (getting
silently produced garbage) over the current isSplitable implementation I
really want to avoid any more of this in the future.
That is why I want to change the implementations in this area of Hadoop in
such a way that this silently producing garbage effect is taken out.

So the question remains: What is the way this should be changed?
I'm willing to build it and submit a patch.




  The safest way would be either 2 or 4. Solution 3 would effectively be
 the
  same as the current implementation, yet it would catch the problem
  situations as long as people stick to normal file name conventions.
  Solution 3 would also allow removing some code duplication in several
  subclasses.
 
  I would go for solution 3.
 
  Niels Basjes




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Change proposal for FileInputFormat isSplitable

2014-06-06 Thread Niels Basjes
On Mon, Jun 2, 2014 at 1:21 AM, Chris Douglas cdoug...@apache.org wrote:

 On Sat, May 31, 2014 at 10:53 PM, Niels Basjes ni...@basjes.nl wrote:
  The Hadoop framework uses the filename extension  to automatically insert
  the right decompression codec in the read pipeline.

 This would be the new behavior, incompatible with existing code.


You are right, I was wrong. It is the LineRecordReader that inserts it.

Looking at this code and where it is used I noticed that the bug I'm trying
to prevent is present in the current trunk.
The NLineInputFormat does not override the isSplitable and used the
LineRecordReader that is capable of reading gzipped input. Overall effect
is that this inputformat silently produces garbage (missing lines +
duplicated lines) when when ran against a gzipped file. I just verified
this.

 So if someone does what you describe then they would need to unload all
 compression codecs or face decompression errors. And if it really was
 gzipped then it would not be splittable at all.

Assume an InputFormat configured for a job assumes that isSplitable
 returns true because it extends FileInputFormat. After the change, it
 could spuriously return false based on the suffix of the input files.
 In the prenominate example, SequenceFile is splittable, even if the
 codec used in each block is not. -C


and if you then give the file the .gz extension this breaks all common
sense / conventions about file names.


Let's reiterate the options I see now:
1) isSplitable -- return true
Too unsafe, I say must change. I alone hit my head twice so far on
this, many others have too, event the current trunk still has this bug in
there.

2) isSplitable -- return false
Safe but too slow in some cases. In those cases the actual
implementation can simply override it very easily and regain their original
performance.

3) isSplitable -- true (same as the current implementation) unless you use
a file extension that is associated with a non splittable compression codec
(i.e.  .gz or something like that).
If a custom format want to break with well known conventions about
filenames then they should simply override the isSplitable with their own.

4) isSplitable -- abstract
Compatibility breaker. I see this as the cleanest way to force the
developer of the custom fileinputformat to think about their specific case.

I hold correct data much higher than performance and scalability; so the
performance impact is a concern but it is much less important than the list
of bugs we are facing right now.

The safest way would be either 2 or 4. Solution 3 would effectively be the
same as the current implementation, yet it would catch the problem
situations as long as people stick to normal file name conventions.
Solution 3 would also allow removing some code duplication in several
subclasses.

I would go for solution 3.

Niels Basjes


Re: Change proposal for FileInputFormat isSplitable

2014-05-31 Thread Niels Basjes
Ok, got it.

If someone has an Avro file (foo.avro) and gzips that ( foo.avro.gz) then
the frame work will select the GzipCodec which is not capable of splitting
and which will cause the problem. So by gzipping a splittable file it
becomes non splittable.

At my workplace we have applied gzip to avro but then the compression
applies to the blocks inside the avro file. So that are multiple gzipped
blocks inside an avro container which is a splittable file without any
changes.

How would someone create the situation you are referring to?
On May 31, 2014 1:06 AM, Doug Cutting cutt...@apache.org wrote:

 I was trying to explain my comment, where I stated that, changing the
 default implementation to return false would be an incompatible
 change.  The patch was added 6 months after that comment, so the
 comment didn't address the patch.

 The patch does not appear to change the default implementation to
 return false unless the suffix of the file name is that of a known
 unsplittable compression format.  So the folks who'd be harmed by this
 are those who used a suffix like .gz for an Avro, Parquet or
 other-format file.  Their applications might suddenly run much slower
 and it would be difficult for them to determine why.  Such folks are
 probably few, but perhaps exist.  I'd prefer a change that avoided
 that possibility entirely.

 Doug

 On Fri, May 30, 2014 at 3:02 PM, Niels Basjes ni...@basjes.nl wrote:
  Hi,
 
  The way I see the effects of the original patch on existing subclasses:
  - implemented isSplitable
 -- no performance difference.
  - did not implement isSplitable
 -- then there is no performance difference if the container is either
  not compressed or uses a splittable compression.
 -- If it uses a common non splittable compression (like gzip) then
 the
  output will suddenly be different (which is the correct answer) and the
  jobs will finish sooner because the input is not processed multiple
 times.
 
  Where do you see a performance impact?
 
  Niels
  On May 30, 2014 8:06 PM, Doug Cutting cutt...@apache.org wrote:
 
  On Thu, May 29, 2014 at 2:47 AM, Niels Basjes ni...@basjes.nl wrote:
   For arguments I still do not fully understand this was rejected by
 Todd
  and
   Doug.
 
  Performance is a part of compatibility.
 
  Doug
 



Re: Change proposal for FileInputFormat isSplitable

2014-05-31 Thread Niels Basjes
The Hadoop framework uses the filename extension  to automatically insert
the right decompression codec in the read pipeline.
So if someone does what you describe then they would need to unload all
compression codecs or face decompression errors. And if it really was
gzipped then it would not be splittable at all.

Niels
On May 31, 2014 11:12 PM, Chris Douglas cdoug...@apache.org wrote:

 On Fri, May 30, 2014 at 11:05 PM, Niels Basjes ni...@basjes.nl wrote:
  How would someone create the situation you are referring to?

 By adopting a naming convention where the filename suffix doesn't
 imply that the raw data are compressed with that codec.

 For example, if a user named SequenceFiles foo.lzo and foo.gz to
 record which codec was used, then isSplittable would spuriously return
 false. -C

  On May 31, 2014 1:06 AM, Doug Cutting cutt...@apache.org wrote:
 
  I was trying to explain my comment, where I stated that, changing the
  default implementation to return false would be an incompatible
  change.  The patch was added 6 months after that comment, so the
  comment didn't address the patch.
 
  The patch does not appear to change the default implementation to
  return false unless the suffix of the file name is that of a known
  unsplittable compression format.  So the folks who'd be harmed by this
  are those who used a suffix like .gz for an Avro, Parquet or
  other-format file.  Their applications might suddenly run much slower
  and it would be difficult for them to determine why.  Such folks are
  probably few, but perhaps exist.  I'd prefer a change that avoided
  that possibility entirely.
 
  Doug
 
  On Fri, May 30, 2014 at 3:02 PM, Niels Basjes ni...@basjes.nl wrote:
   Hi,
  
   The way I see the effects of the original patch on existing
 subclasses:
   - implemented isSplitable
  -- no performance difference.
   - did not implement isSplitable
  -- then there is no performance difference if the container is
 either
   not compressed or uses a splittable compression.
  -- If it uses a common non splittable compression (like gzip) then
  the
   output will suddenly be different (which is the correct answer) and
 the
   jobs will finish sooner because the input is not processed multiple
  times.
  
   Where do you see a performance impact?
  
   Niels
   On May 30, 2014 8:06 PM, Doug Cutting cutt...@apache.org wrote:
  
   On Thu, May 29, 2014 at 2:47 AM, Niels Basjes ni...@basjes.nl
 wrote:
For arguments I still do not fully understand this was rejected by
  Todd
   and
Doug.
  
   Performance is a part of compatibility.
  
   Doug
  
 



Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Niels Basjes
My original proposal (from about 3 years ago) was to change the isSplitable
method to return a safe default ( you can see that in the patch that is
still attached to that Jira issue).
For arguments I still do not fully understand this was rejected by Todd and
Doug.

So that is why my new proposal is to deprecate (remove!) the old method
with the typo in Hadoop 3.0 and replace it with something correct and less
error prone.
Given the fact that this would happen in a major version jump I thought
that would be the right time to do that.

Niels


On Thu, May 29, 2014 at 11:34 AM, Steve Loughran ste...@hortonworks.comwrote:

 On 28 May 2014 20:50, Niels Basjes ni...@basjes.nl wrote:

  Hi,
 
  Last week I ran into this problem again
  https://issues.apache.org/jira/browse/MAPREDUCE-2094
 
  What happens here is that the default implementation of the isSplitable
  method in FileInputFormat is so unsafe that just about everyone who
  implements a new subclass is likely to get this wrong. The effect of
  getting this wrong is that all unit tests succeed and running it against
  'large' input files (64MiB) that are compressed using a non-splittable
  compression (often Gzip) will cause the input to be fed into the mappers
  multiple time (i.e. you get garbage results without ever seeing any
  errors).
 
  Last few days I was at Berlin buzzwords talking to someone about this bug
 

 that was me, I recall.


  and this resulted in the following proposal which I would like your
  feedback on.
 
  1) This is a change that will break backwards compatibility (deliberate
  choice).
  2) The FileInputFormat will get 3 methods (the old isSplitable with the
  typo of one 't' in the name will disappear):
  (protected) isSplittableContainer -- true unless compressed with
  non-splittable compression.
  (protected) isSplittableContent -- abstract, MUST be implemented by
  the subclass
  (public)  isSplittable -- isSplittableContainer 
  isSplittableContent
 
  The idea is that only the isSplittable is used by other classes to know
 if
  this is a splittable file.
  The effect I hope to get is that a developer writing their own
  fileinputformat (which I alone have done twice so far) is 'forced' and
  'helped' getting this right.
 

 I could see making the attributes more explicit would be good -but stopping
 everything that exists from working isn't going to fly.

 what about some subclass, AbstractSplittableFileInputFormat that implements
 the container properly, requires that content one -and then calculates
 IsSplitable() from the results? Existing code: no change, new formats can
 descend from this (and built in ones retrofitted).



  The reason for me to propose this as an incompatible change is that this
  way I hope to eradicate some of the existing bugs in custom
 implementations
  'out there'.
 
  P.S. If you agree to this change then I'm willing to put my back into it
  and submit a patch.
 
  --
  Best regards,
 
  Niels Basjes
 

 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Niels Basjes
This is exactly why I'm proposing a change that will either 'fix silently'
(my original patch from 3 years ago) or 'break loudly' (my current
proposal) old implementations.
I'm convinced that ther are atleast 100 companies world wide that have a
custom implementation with this bug and have no clue they have been basing
descision upon silently corrupted data.


On Thu, May 29, 2014 at 1:21 PM, Jay Vyas jayunit...@gmail.com wrote:

 I think breaking backwards compat is sensible since It's easily caught by
 the compiler and  in this case where the alternative is a
 Runtime error that can result in terabytes of mucked up output.

  On May 29, 2014, at 6:11 AM, Matt Fellows 
 matt.fell...@bespokesoftware.com wrote:
 
  As someone who doesn't really contribute, just lurks, I could well be
 misinformed or under-informed, but I don't see why we can't deprecate a
 method which could cause dangerous side effects?
  People can still use the deprecated methods for backwards compatibility,
 but are discouraged by compiler warnings, and any changes they write to
 their code can start to use the new functionality?
 
  *Apologies if I'm stepping into a Hadoop holy war here
 
 
  On Thu, May 29, 2014 at 10:47 AM, Niels Basjes ni...@basjes.nl wrote:
  My original proposal (from about 3 years ago) was to change the
 isSplitable
  method to return a safe default ( you can see that in the patch that is
  still attached to that Jira issue).
  For arguments I still do not fully understand this was rejected by Todd
 and
  Doug.
 
  So that is why my new proposal is to deprecate (remove!) the old method
  with the typo in Hadoop 3.0 and replace it with something correct and
 less
  error prone.
  Given the fact that this would happen in a major version jump I thought
  that would be the right time to do that.
 
  Niels
 
 
  On Thu, May 29, 2014 at 11:34 AM, Steve Loughran 
 ste...@hortonworks.comwrote:
 
   On 28 May 2014 20:50, Niels Basjes ni...@basjes.nl wrote:
  
Hi,
   
Last week I ran into this problem again
https://issues.apache.org/jira/browse/MAPREDUCE-2094
   
What happens here is that the default implementation of the
 isSplitable
method in FileInputFormat is so unsafe that just about everyone who
implements a new subclass is likely to get this wrong. The effect of
getting this wrong is that all unit tests succeed and running it
 against
'large' input files (64MiB) that are compressed using a
 non-splittable
compression (often Gzip) will cause the input to be fed into the
 mappers
multiple time (i.e. you get garbage results without ever seeing any
errors).
   
Last few days I was at Berlin buzzwords talking to someone about
 this bug
   
  
   that was me, I recall.
  
  
and this resulted in the following proposal which I would like your
feedback on.
   
1) This is a change that will break backwards compatibility
 (deliberate
choice).
2) The FileInputFormat will get 3 methods (the old isSplitable with
 the
typo of one 't' in the name will disappear):
(protected) isSplittableContainer -- true unless compressed
 with
non-splittable compression.
(protected) isSplittableContent -- abstract, MUST be
 implemented by
the subclass
(public)  isSplittable -- isSplittableContainer 
isSplittableContent
   
The idea is that only the isSplittable is used by other classes to
 know
   if
this is a splittable file.
The effect I hope to get is that a developer writing their own
fileinputformat (which I alone have done twice so far) is 'forced'
 and
'helped' getting this right.
   
  
   I could see making the attributes more explicit would be good -but
 stopping
   everything that exists from working isn't going to fly.
  
   what about some subclass, AbstractSplittableFileInputFormat that
 implements
   the container properly, requires that content one -and then calculates
   IsSplitable() from the results? Existing code: no change, new formats
 can
   descend from this (and built in ones retrofitted).
  
  
  
The reason for me to propose this as an incompatible change is that
 this
way I hope to eradicate some of the existing bugs in custom
   implementations
'out there'.
   
P.S. If you agree to this change then I'm willing to put my back
 into it
and submit a patch.
   
--
Best regards,
   
Niels Basjes
   
  
   --
   CONFIDENTIALITY NOTICE
   NOTICE: This message is intended for the use of the individual or
 entity to
   which it is addressed and may contain information that is
 confidential,
   privileged and exempt from disclosure under applicable law. If the
 reader
   of this message is not the intended recipient, you are hereby
 notified that
   any printing, copying, dissemination, distribution, disclosure or
   forwarding of this communication is strictly prohibited. If you have
   received this communication in error, please contact the sender
 immediately

Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Niels Basjes
I forgot to ask a relevant question: What made the original proposed
solution incompatible?
To me it still seems to be a clean backward compatible solution that fixes
this issue in a simple way.

Perhaps Todd can explain why?

Niels
On May 29, 2014 2:17 PM, Niels Basjes ni...@basjes.nl wrote:

 This is exactly why I'm proposing a change that will either 'fix silently'
 (my original patch from 3 years ago) or 'break loudly' (my current
 proposal) old implementations.
 I'm convinced that ther are atleast 100 companies world wide that have a
 custom implementation with this bug and have no clue they have been basing
 descision upon silently corrupted data.


 On Thu, May 29, 2014 at 1:21 PM, Jay Vyas jayunit...@gmail.com wrote:

 I think breaking backwards compat is sensible since It's easily caught by
 the compiler and  in this case where the alternative is a
 Runtime error that can result in terabytes of mucked up output.

  On May 29, 2014, at 6:11 AM, Matt Fellows 
 matt.fell...@bespokesoftware.com wrote:
 
  As someone who doesn't really contribute, just lurks, I could well be
 misinformed or under-informed, but I don't see why we can't deprecate a
 method which could cause dangerous side effects?
  People can still use the deprecated methods for backwards
 compatibility, but are discouraged by compiler warnings, and any changes
 they write to their code can start to use the new functionality?
 
  *Apologies if I'm stepping into a Hadoop holy war here
 
 
  On Thu, May 29, 2014 at 10:47 AM, Niels Basjes ni...@basjes.nl
 wrote:
  My original proposal (from about 3 years ago) was to change the
 isSplitable
  method to return a safe default ( you can see that in the patch that is
  still attached to that Jira issue).
  For arguments I still do not fully understand this was rejected by
 Todd and
  Doug.
 
  So that is why my new proposal is to deprecate (remove!) the old method
  with the typo in Hadoop 3.0 and replace it with something correct and
 less
  error prone.
  Given the fact that this would happen in a major version jump I thought
  that would be the right time to do that.
 
  Niels
 
 
  On Thu, May 29, 2014 at 11:34 AM, Steve Loughran 
 ste...@hortonworks.comwrote:
 
   On 28 May 2014 20:50, Niels Basjes ni...@basjes.nl wrote:
  
Hi,
   
Last week I ran into this problem again
https://issues.apache.org/jira/browse/MAPREDUCE-2094
   
What happens here is that the default implementation of the
 isSplitable
method in FileInputFormat is so unsafe that just about everyone who
implements a new subclass is likely to get this wrong. The effect
 of
getting this wrong is that all unit tests succeed and running it
 against
'large' input files (64MiB) that are compressed using a
 non-splittable
compression (often Gzip) will cause the input to be fed into the
 mappers
multiple time (i.e. you get garbage results without ever seeing any
errors).
   
Last few days I was at Berlin buzzwords talking to someone about
 this bug
   
  
   that was me, I recall.
  
  
and this resulted in the following proposal which I would like your
feedback on.
   
1) This is a change that will break backwards compatibility
 (deliberate
choice).
2) The FileInputFormat will get 3 methods (the old isSplitable
 with the
typo of one 't' in the name will disappear):
(protected) isSplittableContainer -- true unless compressed
 with
non-splittable compression.
(protected) isSplittableContent -- abstract, MUST be
 implemented by
the subclass
(public)  isSplittable -- isSplittableContainer 
isSplittableContent
   
The idea is that only the isSplittable is used by other classes to
 know
   if
this is a splittable file.
The effect I hope to get is that a developer writing their own
fileinputformat (which I alone have done twice so far) is 'forced'
 and
'helped' getting this right.
   
  
   I could see making the attributes more explicit would be good -but
 stopping
   everything that exists from working isn't going to fly.
  
   what about some subclass, AbstractSplittableFileInputFormat that
 implements
   the container properly, requires that content one -and then
 calculates
   IsSplitable() from the results? Existing code: no change, new
 formats can
   descend from this (and built in ones retrofitted).
  
  
  
The reason for me to propose this as an incompatible change is
 that this
way I hope to eradicate some of the existing bugs in custom
   implementations
'out there'.
   
P.S. If you agree to this change then I'm willing to put my back
 into it
and submit a patch.
   
--
Best regards,
   
Niels Basjes
   
  
   --
   CONFIDENTIALITY NOTICE
   NOTICE: This message is intended for the use of the individual or
 entity to
   which it is addressed and may contain information that is
 confidential,
   privileged and exempt from disclosure under applicable law

Change proposal for FileInputFormat isSplitable

2014-05-28 Thread Niels Basjes
Hi,

Last week I ran into this problem again
https://issues.apache.org/jira/browse/MAPREDUCE-2094

What happens here is that the default implementation of the isSplitable
method in FileInputFormat is so unsafe that just about everyone who
implements a new subclass is likely to get this wrong. The effect of
getting this wrong is that all unit tests succeed and running it against
'large' input files (64MiB) that are compressed using a non-splittable
compression (often Gzip) will cause the input to be fed into the mappers
multiple time (i.e. you get garbage results without ever seeing any
errors).

Last few days I was at Berlin buzzwords talking to someone about this bug
and this resulted in the following proposal which I would like your
feedback on.

1) This is a change that will break backwards compatibility (deliberate
choice).
2) The FileInputFormat will get 3 methods (the old isSplitable with the
typo of one 't' in the name will disappear):
(protected) isSplittableContainer -- true unless compressed with
non-splittable compression.
(protected) isSplittableContent -- abstract, MUST be implemented by
the subclass
(public)  isSplittable -- isSplittableContainer 
isSplittableContent

The idea is that only the isSplittable is used by other classes to know if
this is a splittable file.
The effect I hope to get is that a developer writing their own
fileinputformat (which I alone have done twice so far) is 'forced' and
'helped' getting this right.

The reason for me to propose this as an incompatible change is that this
way I hope to eradicate some of the existing bugs in custom implementations
'out there'.

P.S. If you agree to this change then I'm willing to put my back into it
and submit a patch.

-- 
Best regards,

Niels Basjes


Re: Sorting user defined MR counters.

2013-01-07 Thread Niels Basjes
Hi Steve,

 Now for submitting changes for Hadoop: Is it desirable that I fix these in
  my change set or should I leave these as-is to avoid obfuscating the
  changes that are relevant to the Jira at hand?
 

 I recommend a cleanup first -that's likely to go in without any argument.
 Your patch with the new features would be a diff against the clean, so have
 less changes to be reviewed.



Ok, I'll have a look what I can do.
Should I focus on fixing problems within the entire code base or limit my
changes to a limited set of subprojects (i.e. only the mapreduce ones) ?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Making Gzip splittable

2012-02-22 Thread Niels Basjes
Hi,

On Wed, Feb 22, 2012 at 19:14, Tim Broberg tim.brob...@exar.com wrote:

 There are three options here:
  1 - Add your codec, and alternative to the default gzip codec.
  2 - Modify the gzip codec to incorporate your feature so that it is
 pseudo-splittable by default (skippable?)
  3 - Do nothing

 The code uses the normal splittability interface and doesn't invent some
 new solution. It seems perfectly well explained.


The choice was made to implement it as a separate 'Codec' that reuses all
decompression functionality from the existing GzipCodec without making any
changes to the original. This way there is no duplicate code and there is
no risk the existing GzipCodec is affected by the new functionality.
This was actually one of the first review comments I got on one of the
first versions (which did have a few minor changes in the GzipCodec).
So that is why option '1' was chosen instead of '2'.

There is a lot of explanation in there on how to switch over from one codec
 to the other.


Enabling the codec is only one setting.
There is however a quite a bit of documentation on the matter how to use it.

Does it all get simpler if skippability is implemented by default but the
 option is not enabled?


There are two answers to this:
1) No, it won't get simpler.
2) This feature cannot be disabled per codec. The reason is that the
framework creates splits if the applicable codec implements the
SplittableCompressionCodec. This check is done purely by doing an
instanceof check. After that the FileInputFormat creates the splits
without consulting the codec class at all. So either a codec is splittable
or not. And the splits are defined independent of the codec.
So there is (unfortunately) currently no way to create a codec that can be
splittable/non-splittable by using a config setting.

Does this make things any less potentially confusing?


I don't think this would make it less confusing.

Niels




- Tim.

 
 From: ni...@basj.es [ni...@basj.es] On Behalf Of Niels Basjes [
 ni...@basjes.nl]
 Sent: Sunday, February 19, 2012 4:23 PM
 To: common-dev
 Subject: Making Gzip splittable

 Hi,

 As some of you know I've created a patch that effectively makes Gzip
 splittable.

 https://issues.apache.org/jira/browse/HADOOP-7076

 What this does is for a split somewhere in the middle of the file it will
 read from the start of the file up until the point where the split starts.
 This is a useful waste of resources because it creates room to run a heavy
 lifting mapper in parallel.
 Due to this balance between the waste being useful and the waste being
 wasteful I've included extensive documentation in the patch on how it works
 and how to use it.

 I've seen that there are quite a few real life situations where I expect my
 solution can be useful.

 What I created is as far as I can tell the only way you can split a gzipped
 file without prior knowledge about the actual file.
 If you do have prior information then other directions with a similar goal
 are possible:
 - Analyzing the file beforehand:
 HADOOP-6153https://issues.apache.org/jira/browse/HADOOP-6153
 - Create a specially crafted gzipped file:
 HADOOP-7909https://issues.apache.org/jira/browse/HADOOP-7909

 Over the last year I've had review comments from Chris Douglas (until he
 stopped being involved in Hadoop) and later from Luke Lu.

 Now the last feedback I got from Luke is this:

  Niels, I'm ambivalent about this patch. It has clean code and
  documentation, OTOH, it has really confusing usage/semantics and
  dubious general utility that the community might not want to maintain
  as part of an official release. After having to explain many finer
  points of Hadoop to new users/developers these days, I think the
  downside of this patch might out weight its benefits. I'm -0 on it.
  i.e., you need somebody else to +1 on this.

 So after consulting Eli I'm asking this group.

 My views on this feature:
 - I think this feature should go in because I think others can benefit from
 it.
 - I also think that it should remain disabled by default. It can then be
 used by those that read the documentation.
 - The implementation does not contain any decompression code at all. It
 only does the splitting smartness. (It could even be refactored to make any
 codec splittable). It has been tested with both the java and the native
 decompressors.

 What do you think?

 Is this a feature that should go in the official release or not?

 --
 Best regards

 Niels Basjes

 The information and any attached documents contained in this message
 may be confidential and/or legally privileged.  The message is
 intended solely for the addressee(s).  If you are not the intended
 recipient, you are hereby notified that any use, dissemination, or
 reproduction is strictly prohibited and may be unlawful.  If you are
 not the intended recipient, please contact the sender immediately by
 return e-mail and destroy all copies

Re: Gzip progress during map phase.

2011-12-27 Thread Niels Basjes
Yes, this is what i was looking for.
Thanks

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 27 dec. 2011 12:08 schreef Koji Noguchi knogu...@yahoo-inc.com het
volgende:

 Assuming you're using TextInputFormat, it sounds like
 https://issues.apache.org/jira/browse/MAPREDUCE-773
 In 0.21.  Don't know about CDH.

 Koji


 On 12/27/11 2:00 AM, Niels Basjes ni...@basjes.nl wrote:

  I would not expect this. I would expect behaviour that is independent of
  the way the splits are created.
 
  --
  Met vriendelijke groet,
  Niels Basjes
  (Verstuurd vanaf mobiel )
  Op 26 dec. 2011 07:57 schreef Anthony Urso antho...@cs.ucla.edu het
  volgende:
 
  Gzip files (unlike uncompressed files) are not splittable, which may be
  causing the behavior that you described.
  On Dec 24, 2011 6:24 AM, Niels Basjes ni...@basjes.nl wrote:
 
  Hi,
 
  I noticed that the mapper progress indication in the hadoop cdh3
  distribution jumps from 0% to 100% for each gzipped input file. So when
  running with big gzipped input files the job appears to be stuck.
 
  I was unable to find a jira issue that describes this effect.
  Before I dive into this I have a few questions to you guys:
  1) is this a known effect for the 0.20 version? If so what is the jira
  issue?
  2) is this specific to gzip?
  3) is this effect still present in the MRv2/yarn version of Hadoop?
 
  Thanks.
  --
  Met vriendelijke groet,
  Niels Basjes
  (Verstuurd vanaf mobiel )
 
 




Re: Gzip progress during map phase.

2011-12-27 Thread Niels Basjes
I would not expect this. I would expect behaviour that is independent of
the way the splits are created.

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 26 dec. 2011 07:57 schreef Anthony Urso antho...@cs.ucla.edu het
volgende:

 Gzip files (unlike uncompressed files) are not splittable, which may be
 causing the behavior that you described.
 On Dec 24, 2011 6:24 AM, Niels Basjes ni...@basjes.nl wrote:

  Hi,
 
  I noticed that the mapper progress indication in the hadoop cdh3
  distribution jumps from 0% to 100% for each gzipped input file. So when
  running with big gzipped input files the job appears to be stuck.
 
  I was unable to find a jira issue that describes this effect.
  Before I dive into this I have a few questions to you guys:
  1) is this a known effect for the 0.20 version? If so what is the jira
  issue?
  2) is this specific to gzip?
  3) is this effect still present in the MRv2/yarn version of Hadoop?
 
  Thanks.
  --
  Met vriendelijke groet,
  Niels Basjes
  (Verstuurd vanaf mobiel )
 



Re: Which branch for my patch?

2011-12-06 Thread Niels Basjes
I got this:

Hadoop QA commented on HADOOP-7076:
-1 overall.  Here are the results of testing the latest attachment

http://issues.apache.org/jira/secure/attachment/12506182/HADOOP-7076-branch-0.22.patch
  against trunk revision .
...
-1 patch.  The patch command could not apply the patch.

Did I do something wrong in the patch I created for branch-0.22?
Or is HADOOP-7435 not yet operational?

Thanks.

Niels Basjes


On Tue, Dec 6, 2011 at 00:17, Niels Basjes ni...@basjes.nl wrote:

 Hi,

 On Mon, Dec 5, 2011 at 18:54, Eli Collins e...@cloudera.com wrote:

  https://issues.apache.org/jira/browse/HADOOP-7076



   Your patch based on the old structure would be useful for
 backporting
   this feature from trunk to a release with the old structure (eg
 1.x,
   0.22). To request inclusion in a 1.x release set the target
 version to
   1.1.0 (and generate a patch against branch-1). To request
 inclusion in
   0.22 set target version to 0.22.0 (and generate a patch against
   branch-0.22).


 Turns out my changes are not trivial to backport to branch-1 because the
 SplittableCompressionCodec interface isn't there yet.
 So I'm limiting to trunk (0.23.1) and branch-0.22 (0.22.0).

 ...


 Unfortunately jenkins doesn't currently run tests against non-trunk
 trees.  For these branches you need to run test-patch (covered in the
 above page) and the tests yourself.

 Unfortunately the test-patch script mentioned on the site
 (dev-support/test-patch.sh /path/to/my.patch) doesn't exist in the
 branch-0.22. And I couldn't get the test-patch.sh that does exist to work.

 So I did against a clean checkout:
   patch -p0  HADOOP-7076-branch-0.22.patch
 followed by
   ant clean test jar

 Which succeeded.


 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Which branch for my patch?

2011-12-05 Thread Niels Basjes
Hi,

 https://issues.apache.org/jira/browse/HADOOP-7076
 
  Jenkins has just completed.
  Although it passed everything else it was '-1' because of 9 javadoc
  warnings that do not seem related to my patch.
 
 
 https://builds.apache.org/job/PreCommit-HADOOP-Build/432/artifact/trunk/hadoop-common-project/patchprocess/patchJavadocWarnings.txt

 Yea, these are not due to your patch. I'll bump the javadoc warnings
 in test-patch.properties.


Thanks.



  Your patch based on the old structure would be useful for backporting
  this feature from trunk to a release with the old structure (eg 1.x,
  0.22). To request inclusion in a 1.x release set the target version to
  1.1.0 (and generate a patch against branch-1). To request inclusion in
  0.22 set target version to 0.22.0 (and generate a patch against
  branch-0.22).

  Do I simply make separate Jira (related) issues for these backports?

 Nope, just set the target version field to the appropriate version
 and uploaded a patch, eg hadoop-7076-branch-1.patch.


So that I understand correctly:
- I add the targets to the Jira issue for each branch specific patch.
- I create a new patch file for each version I want the feature to appear
in and attach these to the issue.
- I name these patches something like issue id-date-branchid.patch so
that the committer can clearly see what it was intended for.

Do I have to do something to ensure Jenkins will accept this all correctly?
Perhaps in naming convention? Or in the timing between uploading the
various patches?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Which branch for my patch?

2011-12-04 Thread Niels Basjes
Hi,

On Wed, Nov 30, 2011 at 18:51, Eli Collins e...@cloudera.com wrote:

 Thanks for contributing.  The nest place to contribute new features is
 to trunk. It's currently an easy merge from trunk to branch 23 to get
 it in a 23.x release (you can set the jira's target version to 23.1 to
 indicate this).


I've just uploaded the new patch created against the trunk and set the
target for 0.23.1 as you indicated.

https://issues.apache.org/jira/browse/HADOOP-7076

Jenkins has just completed.
Although it passed everything else it was '-1' because of 9 javadoc
warnings that do not seem related to my patch.

https://builds.apache.org/job/PreCommit-HADOOP-Build/432/artifact/trunk/hadoop-common-project/patchprocess/patchJavadocWarnings.txt

Your patch based on the old structure would be useful for backporting
 this feature from trunk to a release with the old structure (eg 1.x,
 0.22). To request inclusion in a 1.x release set the target version to
 1.1.0 (and generate a patch against branch-1). To request inclusion in
 0.22 set target version to 0.22.0 (and generate a patch against
 branch-0.22).


Do I simply make separate Jira (related) issues for these backports?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Which branch for my patch?

2011-12-01 Thread Niels Basjes
Thanks,

I'll get busy creating a new patch over the next few days.

Niels Basjes

On Wed, Nov 30, 2011 at 18:51, Eli Collins e...@cloudera.com wrote:

 Hey Niels,

 Thanks for contributing.  The nest place to contribute new features is
 to trunk. It's currently an easy merge from trunk to branch 23 to get
 it in a 23.x release (you can set the jira's target version to 23.1 to
 indicate this).

 Your patch based on the old structure would be useful for backporting
 this feature from trunk to a release with the old structure (eg 1.x,
 0.22). To request inclusion in a 1.x release set the target version to
 1.1.0 (and generate a patch against branch-1). To request inclusion in
 0.22 set target version to 0.22.0 (and generate a patch against
 branch-0.22).

 Thanks,
 Eli

 On Wed, Nov 30, 2011 at 8:23 AM, Niels Basjes ni...@basjes.nl wrote:
  Hi all,
 
  A while ago I created a feature for Hadoop and submitted this to be
  included (HADOOP-7076) .
  Around the same time the MRv2 started happening and the entire source
 tree
  was restructured.
 
  At this moment I'm prepared to change the patch I created earlier so I
 can
  submit it again for your consideration.
 
  Caused by the email about the new branches (branch-1 and branch-1.0) I'm
 a
  bit puzzled at this moment where to start.
 
  I see the mentioned branches and the trunk at probable starting points.
 
  As far as I understand the repository structure the branch-1 is the basis
  for the old style Hadoop and the trunk is the basis for the yarn
 Hadoop.
 
  For which branch of the source tree should I make my changes so you guys
  will reevaluate it for inclusion?
 
  Thanks.
 
  --
  Best regards / Met vriendelijke groeten,
 
  Niels Basjes




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Which branch for my patch?

2011-11-30 Thread Niels Basjes
Hi all,

A while ago I created a feature for Hadoop and submitted this to be
included (HADOOP-7076) .
Around the same time the MRv2 started happening and the entire source tree
was restructured.

At this moment I'm prepared to change the patch I created earlier so I can
submit it again for your consideration.

Caused by the email about the new branches (branch-1 and branch-1.0) I'm a
bit puzzled at this moment where to start.

I see the mentioned branches and the trunk at probable starting points.

As far as I understand the repository structure the branch-1 is the basis
for the old style Hadoop and the trunk is the basis for the yarn Hadoop.

For which branch of the source tree should I make my changes so you guys
will reevaluate it for inclusion?

Thanks.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: making file system block size bigger to improve hdfs performance ?

2011-10-03 Thread Niels Basjes
Have you tried it to see what diffrence it makes?

-- 
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 3 okt. 2011 07:06 schreef Jinsong Hu jinsong...@hotmail.com het
volgende:
 Hi, There:
 I just thought an idea. When we format the disk , the block size is
 usually 1K to 4K. For hdfs, the block size is usually 64M.
 I wonder if we change the raw file system's block size to something
 significantly bigger, say, 1M or 8M, will that improve
 disk IO performance for hadoop's hdfs ?
 Currently, I noticed that mapr distribution uses mfs, its own file system.

 That resulted in 4 times performance gain in terms
 of disk IO. I just wonder if we tune the hosting os parameters, we can
 achieve better disk IO performance with just the regular
 apache hadoop distribution.
 I understand that making the block size bigger can result in some disk
 space waste for small files. However, for disk dedicated
 for hdfs, where most of the files are very big, I just wonder if it is a
 good idea. Any body have any comment ?

 Jimmy



What happened to Chris?

2011-07-31 Thread Niels Basjes
Hi all,

Over the last 7 months I've been periodically emailing with Chris Douglas
(via his @apache.org account) about a new feature for Hadoop I've created
(HADOOP-7076).
However I've not had any response at all to my last two emails (dating back
about 6 weeks and about 1 week).

So I'm wondering why this is happening. Is he on a long vacation? Or is
there something else?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Hadoop Annotations support

2011-07-02 Thread Niels Basjes
I assume you mean something like this:

https://github.com/SpringSource/spring-hadoop

On Sat, Jul 2, 2011 at 04:22, Raja Nagendra Kumar
nagendra.r...@tejasoft.com wrote:

 Hi,

 Is there any plans for Hadoop to support annodations specially for the api
 level configurations eliminations.. eg.

 conf.setMapperClass(MaxTemperatureMapper.class);
 conf.setCombinerClass(MaxTemperatureReducer.class);
 conf.setReducerClass(MaxTemperatureReducer.class);

 can be easily elimiated through proper class level annotations.


 Regards,
 Raja Nagendra Kumar,
 C.T.O
 www.tejasoft.com

 --
 View this message in context: 
 http://old.nabble.com/Hadoop-Annotations-support-tp31977831p31977831.html
 Sent from the Hadoop core-dev mailing list archive at Nabble.com.





-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: help me to solve Exception

2011-06-14 Thread Niels Basjes
 11/06/04 01:47:09 WARN hdfs.DFSClient: DataStreamer Exception: 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File 
 /user/eng-zinab/inn/In (copy) could only be replicated to 0 nodes, instead of 
 1

Do you have a datanode running?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


[jira] [Reopened] (HADOOP-7305) Eclipse project files are incomplete

2011-06-04 Thread Niels Basjes (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Basjes reopened HADOOP-7305:
--


Apparently there are some issues with the first version in combination with OS 
X. 

 Eclipse project files are incomplete
 

 Key: HADOOP-7305
 URL: https://issues.apache.org/jira/browse/HADOOP-7305
 Project: Hadoop Common
  Issue Type: Improvement
  Components: build
Reporter: Niels Basjes
Assignee: Niels Basjes
Priority: Minor
 Fix For: 0.22.0

 Attachments: HADOOP-7305-2011-05-19.patch, 
 HADOOP-7305-2011-05-30.patch


 After a fresh checkout of hadoop-common I do 'ant compile eclipse'.
 I open eclipse, set ANT_HOME and build the project. 
 At that point the following error appears:
 {quote}
 The type com.sun.javadoc.RootDoc cannot be resolved. It is indirectly 
 referenced from required .class files   
 ExcludePrivateAnnotationsJDiffDoclet.java   
 /common/src/java/org/apache/hadoop/classification/tools line 1  Java Problem
 {quote}
 The solution is to add the tools.jar from the JDK to the 
 buildpath/classpath.
 This should be fixed in the build.xml.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Eclipse target

2011-05-19 Thread Niels Basjes
Hi Todd,

2011/5/19 Todd Lipcon t...@cloudera.com:
 Yes, I have to do this same thing manually every time I re-run ant eclipse.

 So, it seems like we should add it to the eclipse target, like you
 said. Feel free to file a JIRA and patch!

https://issues.apache.org/jira/browse/HADOOP-7305
(Hadoop QA should start in a few minutes to validate this).

-- 
Met vriendelijke groeten,

Niels Basjes


Re: MapReduce compilation error

2011-05-18 Thread Niels Basjes
Today I ran into the same error and I was puzzled by the content of this file.
What is the purpose of a test file that appears to have a deliberate
error and no code what so ever?


2011/3/19 Harsh J qwertyman...@gmail.com:
 This shouldn't really interfere with your development. You may try to
 exclude it from Eclipse's build, perhaps.

 On Sat, Mar 19, 2011 at 1:39 AM, bikash sharma sharmabiks...@gmail.com 
 wrote:
 Hi,
 When I am compiling MapReduce source code after checking-in Eclipse, I am
 getting the following error:

 The declared package  does not match the expected package testjar
 ClassWithNoPackage.java Hadoop-MR/src/test/mapred/testjar

 Any thoughts?

 Thanks,
 Bikash




 --
 Harsh J
 http://harshj.com




-- 
Met vriendelijke groeten,

Niels Basjes


Eclipse target

2011-05-18 Thread Niels Basjes
When I checkout common, run ant eclipse and then open eclipse I get this error:

The type com.sun.javadoc.RootDoc cannot be resolved. It is indirectly
referenced from required .class
files   ExcludePrivateAnnotationsJDiffDoclet.java   
/common/src/java/org/apache/hadoop/classification/tools line
1   Java Problem

The problem is that the tools.jar is missing during build.

Now I'm extremely novice when it come to the ant build.xml but I think
I solved the problem by ensuring the tools.jar is added to the eclipse
project files (i.e. the .classpath file).
My question to you all: is this the correct solution? Is it worth
submitting as a patch?

diff --git build.xml build.xml
index 26ccfa0..168b34f 100644
--- build.xml
+++ build.xml
@@ -1571,6 +1571,7 @@
 library pathref=ivy-test.classpath exported=false /
 variable path=ANT_HOME/lib/ant.jar exported=false /
 library path=${conf.dir} exported=false /
+library path=${java.home}/../lib/tools.jar exported=false /
   /classpath
 /eclipse
   /target




-- 
Met vriendelijke groeten,

Niels Basjes


Report as a bug?

2011-01-29 Thread Niels Basjes
I was playing around with PMD, just to see what kind of messages it
gives on my hadoop feature.
I noticed a message about Dead code in org.apache.hadoop.fs.ftp.FTPFileSystem

Starting at about line 80:

String userAndPassword = uri.getUserInfo();
if (userAndPassword == null) {
  userAndPassword = (conf.get(fs.ftp.user. + host, null) + : + conf
  .get(fs.ftp.password. + host, null));
  if (userAndPassword == null) {
throw new IOException(Invalid user/passsword specified);
  }
}


The last if block is the dead code as the string will always contain
at least the text : or null:null
It will probably fail a bit later when really trying to login with a
wrong uid/password.
So, is this worth reporting as a bug?

-- 
Met vriendelijke groeten,

Niels Basjes


[jira] Created: (HADOOP-7127) Bug in login error handling in org.apache.hadoop.fs.ftp.FTPFileSystem

2011-01-29 Thread Niels Basjes (JIRA)
Bug in login error handling in org.apache.hadoop.fs.ftp.FTPFileSystem
-

 Key: HADOOP-7127
 URL: https://issues.apache.org/jira/browse/HADOOP-7127
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Reporter: Niels Basjes


I was playing around with PMD, just to see what kind of messages it gives on 
hadoop.
I noticed a message about Dead code in org.apache.hadoop.fs.ftp.FTPFileSystem

Starting at about line 80:

   String userAndPassword = uri.getUserInfo();
   if (userAndPassword == null) {
 userAndPassword = (conf.get(fs.ftp.user. + host, null) + : + conf
 .get(fs.ftp.password. + host, null));
 if (userAndPassword == null) {
   throw new IOException(Invalid user/passsword specified);
 }
   }

The last if block is the dead code as the string will always contain at least 
the text : or null:null

This means that the error handling fails to work as intended.
It will probably fail a bit later when really trying to login with a wrong 
uid/password.

P.S. Fix the silly typo passsword in the exception message too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [Hadoop Wiki] Update of FrontPage by prosch

2011-01-10 Thread Niels Basjes
Seems like a good moment to blacklist the prosch user from ever
changing the wiki again.

2011/1/10 Apache Wiki wikidi...@apache.org:
 Dear Wiki user,

 You have subscribed to a wiki page or wiki category on Hadoop Wiki for 
 change notification.

 The FrontPage page has been changed by prosch.
 http://wiki.apache.org/hadoop/FrontPage?action=diffrev1=175rev2=176

 --

  == General Information ==
   * [[http://hadoop.apache.org/|Official Apache Hadoop Website]]: download, 
 bug-tracking, mailing-lists, etc.
   * [[ProjectDescription|Overview]] of Apache Hadoop
 -  * [[FAQ]]
 +  * [[FAQ]] 
 [[http://www.profi-fachuebersetzung.de/language-translation.html|Translation 
 agency]] / [[http://www.profischnell.com|Übersetzung Polnisch Deutsch]]
   * [[HadoopIsNot|What Hadoop is not]]
   * [[Distributions and Commercial Support|Distributions and Commercial 
 Support]] for Hadoop (RPMs, Debs, AMIs, etc)
   * [[HadoopPresentations|Presentations]], [[Books|books]], 
 [[HadoopArticles|articles]] and [[Papers|papers]] about Hadoop
   * PoweredBy, a list of sites and applications powered by Apache Hadoop
   * Support
    * [[Help|Getting help from the hadoop community]].
 -   * [[Support|People and companies for hire]].
 +   * [[Support|People and companies for hire]].
   * [[Conferences|Hadoop Community Events and Conferences]]
    * HadoopUserGroups (HUGs)
    * HadoopSummit




-- 
Met vriendelijke groeten,

Niels Basjes


Ready for review?

2011-01-07 Thread Niels Basjes
Hi,

I consider the patch I created to be ready for review by a code reviewer.

It does what I want it to do and Hudson gives an overall +1.
The http://wiki.apache.org/hadoop/HowToContribute was unclear to me on
what to do next.

So, what should I do?
Simply wait / change the stated of the issue / ...

https://issues.apache.org/jira/browse/HADOOP-7076

-- 
Met vriendelijke groeten,

Niels Basjes


Re: Ready for review?

2011-01-07 Thread Niels Basjes
Hi Jakob,

Thanks for clarifying this to me.

Niels

2011/1/7 Jakob Homan jgho...@gmail.com:
 Niels-
   Thanks for the contribution.  For the moment your task is done.
 Now it's up to a committer to review the patch and, either provide you
 with feedback for its improvement, or commit it.  It's in the patch
 available state, which is the flag for reviewers to know there's work
 for them to do.  Since this is a volunteer effort, I'm afraid there's
 no firm timeline for when this will get done.
 -Jakob


 On Fri, Jan 7, 2011 at 6:50 AM, Niels Basjes ni...@basjes.nl wrote:
 Hi,

 I consider the patch I created to be ready for review by a code reviewer.

 It does what I want it to do and Hudson gives an overall +1.
 The http://wiki.apache.org/hadoop/HowToContribute was unclear to me on
 what to do next.

 So, what should I do?
 Simply wait / change the stated of the issue / ...

 https://issues.apache.org/jira/browse/HADOOP-7076

 --
 Met vriendelijke groeten,

 Niels Basjes





-- 
Met vriendelijke groeten,

Niels Basjes


Jira workflow problem.

2011-01-06 Thread Niels Basjes
I seem to have selected the wrong option in Jira to get the latest
patch handled.
For some reason the option to indicate a new patch has been made
available is no longer present.

https://issues.apache.org/jira/browse/HADOOP-7076

What did I do wrong and what can I do to fix this?

Thanks.
-- 
Met vriendelijke groeten,

Niels Basjes


Re: Jira workflow problem.

2011-01-06 Thread Niels Basjes
Thanks guys,

I really appreciate your help.

Niels Basjes

2011/1/6 Doug Cutting cutt...@apache.org:
 The problem was that the submitter could transition from Patch Available
 to In Progress, following the Resume Progress transition, but the
 submitter cannot then transtion anywhere from In Progress, only the
 assignee could.  I fixed that, so that the assignee can no longer follow
 that transition to a cul de sac.  I also made you a contributor so you could
 be assigned issues, and assigned you this issue.

 Doug

 On 01/06/2011 01:41 PM, Niels Basjes wrote:

 I seem to have selected the wrong option in Jira to get the latest
 patch handled.
 For some reason the option to indicate a new patch has been made
 available is no longer present.

 https://issues.apache.org/jira/browse/HADOOP-7076

 What did I do wrong and what can I do to fix this?

 Thanks.




-- 
Met vriendelijke groeten,

Niels Basjes


Build failed, hudson broken?

2011-01-05 Thread Niels Basjes
Hi,

I just submitted a patch for the feature I've been working on.
https://issues.apache.org/jira/browse/HADOOP-7076

This patch works fine on my system and passes all the unit tests.

Now some 30 minutes later it seems the build on the hudson has failed.
https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/155/

I'm not sure but to me it seems that there are issues with the hudson.
None of the errors in the log are related to my fixes and not only my
build (155)
but also builds 154 and 153 have failed with errors that are (at first
glance) the same.

Someone here knows how/where to get this problem fixed?

-- 
Met vriendelijke groeten,

Niels Basjes


Re: Build failed, hudson broken?

2011-01-05 Thread Niels Basjes
I found where to report this ... so I did:
https://issues.apache.org/jira/browse/INFRA-3340

2011/1/5 Niels Basjes ni...@basjes.nl:
 Hi,

 I just submitted a patch for the feature I've been working on.
 https://issues.apache.org/jira/browse/HADOOP-7076

 This patch works fine on my system and passes all the unit tests.

 Now some 30 minutes later it seems the build on the hudson has failed.
 https://hudson.apache.org/hudson/job/PreCommit-HADOOP-Build/155/

 I'm not sure but to me it seems that there are issues with the hudson.
 None of the errors in the log are related to my fixes and not only my
 build (155)
 but also builds 154 and 153 have failed with errors that are (at first
 glance) the same.

 Someone here knows how/where to get this problem fixed?

 --
 Met vriendelijke groeten,

 Niels Basjes




-- 
Met vriendelijke groeten,

Niels Basjes


What is the correct spelling?

2010-12-25 Thread Niels Basjes
Hi,

I noticed that the isSplitable method (and a bunch of other places in
the Hadoop codebase) writes splitable where I would have expected
splittable (two 't' instead of one).
All spelling functionality I have indicates the 'double t' version is correct.

Should I correct this in the junit test files I'm touching?

-- 
Best regards,

Niels Basjes


[jira] Created: (HADOOP-7076) Splittable Gzip

2010-12-23 Thread Niels Basjes (JIRA)
Splittable Gzip
---

 Key: HADOOP-7076
 URL: https://issues.apache.org/jira/browse/HADOOP-7076
 Project: Hadoop Common
  Issue Type: New Feature
  Components: io
Reporter: Niels Basjes


Files compressed with the gzip codec are not splittable due to the nature of 
the codec.
This limits the options you have scaling out when reading large gzipped input 
files.

Given the fact that gunzipping a 1GiB file usually takes only 2 minutes I 
figured that for some use cases wasting some resources may result in a shorter 
job time under certain conditions.
So reading the entire input file from the start for each split (wasting 
resources!!) may lead to additional scalability.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.