Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

Hyukjin Kwon Wed, 09 Dec 2015 01:28:47 -0800

Thank you for your reply!

I have already done the change locally. So for changing it would be fine.


I just wanted to be sure which way is correct.
On 9 Dec 2015 18:20, "Fengdong Yu" <fengdo...@everstring.com> wrote:

> I don’t think there is performance difference between 1.x API and 2.x API.
>
> but it’s not a big issue for your change, only
> com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java
> <https://github.com/databricks/spark-xml/blob/master/src/main/java/com/databricks/hadoop/mapreduce/lib/input/XmlInputFormat.java>
>  need to change, right?
>
> It’s not a big change to 2.x API. if you agree, I can do, but I cannot
> promise the time within one or two weeks because of my daily job.
>
>
>
>
>
> On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
> Hi all,
>
> I am writing this email to both user-group and dev-group since this is
> applicable to both.
>
> I am now working on Spark XML datasource (
> https://github.com/databricks/spark-xml).
> This uses a InputFormat implementation which I downgraded to Hadoop 1.x
> for version compatibility.
>
> However, I found all the internal JSON datasource and others in Databricks
> use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the
> method for this because TaskAttemptContext is a class in Hadoop 1.x and an
> interface in Hadoop 2.x.
>
> So, I looked through the codes for some advantages for Hadoop 2.x API but
> I couldn't.
> I wonder if there are some advantages for using Hadoop 2.x API.
>
> I understand that it is still preferable to use Hadoop 2.x APIs at least
> for future differences but somehow I feel like it might not have to use
> Hadoop 2.x by reflecting a method.
>
> I would appreciate that if you leave a comment here
> https://github.com/databricks/spark-xml/pull/14 as well as sending back a
> reply if there is a good explanation
>
> Thanks!
>
>
>

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

Reply via email to