Re: HUDI-1232

2020-08-29 Thread Balaji Varadarajan
 Hi Selvaraj,
Yes, you are right. Sorry for the confusion. As mentioned in the release notes, 
Spark 2.4.4 runtime is needed although I dont remember what problem you will 
encounter with Spark 2.3.3. I think it will be a worthwhile exercise for you to 
upgrade to Spark 2.4.4 and Hudi latest versions as we had been and continuing 
to improve performance in Hudi :) For instance, the very next release will have 
consolidated metadata which would avoid file listing in the first place. 
THanks,Balaji.VOn Saturday, August 29, 2020, 11:09:25 AM PDT, selvaraj 
periyasamy  wrote:  
 
 Thanks Balaji,

I am looking into the steps to upgrade to 0.6.0. I noticed the below
content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
It says the runtime spark version must be 2.4+. Little confused now. Could
you shed more light on this?
Release HighlightsPermalink


  - Dependency Version Upgrades
      - Upgrade from Spark 2.1.0 to Spark 2.4.4
      - Upgrade from Avro 1.7.7 to Avro 1.8.2
      - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
      - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
      spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
  - *IMPORTANT* This version requires your runtime spark version to be
  upgraded to 2.4+.

Thanks,
Selva

On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
 wrote:

>  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> repeating which suggests this is the read side. So, we recommend you using
> latest version. I tried 2.3.3 and ran quickstart without issues. Give it a
> shot and let us know if there are any issues.
> Balaji.V
>    On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1...@gmail.com> wrote:
>
>  Thanks Balaji. My hadoop environment is still running with spark 2.3. Can
> I
> run 0.6.0 on spark 2.3?
>
> For issue 1: I am able to manage it with spark glob read, instead of
> hive read. With this approach, I am good with this approach.
>  Issue 2: I see the performance issue while writing into the COW table.
> This is purely write and no read involved.  Attached the write logs (
> hudiLogs.txt) in the ticket . The more and more my target has partitions, I
> am noticing a spike in write time.  The fix #1919 mentioned is applicable
> for writing as well.
>
> On Fri, Aug 28, 2020 at 3:28 PM vbal...@apache.org 
> wrote:
>
> >  Hi Selvaraj,
> > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> read
> > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > please try 0.6.0
> > Balaji.V
> >    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1...@gmail.com> wrote:
> >
> >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > ticket
> > for tracking a couple of issues.
> >
> > One of the concerns I have in my use cases is that, have a COW type table
> > name called TRR.  I see below pasted logs rolling for all individual
> > partitions even though my write is on only a couple of partitions  and it
> > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> wondering
> > , in the future , I will have 3 years worth of data, and writing will be
> > very slow every time I write into only a couple of partitions.
> >
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@fed0a8b
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/01, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, m
> > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > hdfs-site.xml], FileSystem:
> > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > ugi=svchdc36q@V
> > ISA.COM (auth:KERBEROS)]]]
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties 

Re: HUDI-1232

2020-08-29 Thread selvaraj periyasamy
Thanks Balaji,

I am looking into the steps to upgrade to 0.6.0. I noticed the below
content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
It says the runtime spark version must be 2.4+. Little confused now. Could
you shed more light on this?
Release HighlightsPermalink


   - Dependency Version Upgrades
  - Upgrade from Spark 2.1.0 to Spark 2.4.4
  - Upgrade from Avro 1.7.7 to Avro 1.8.2
  - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
  - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
  spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
   - *IMPORTANT* This version requires your runtime spark version to be
   upgraded to 2.4+.

Thanks,
Selva

On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
 wrote:

>  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> repeating which suggests this is the read side. So, we recommend you using
> latest version. I tried 2.3.3 and ran quickstart without issues. Give it a
> shot and let us know if there are any issues.
> Balaji.V
> On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1...@gmail.com> wrote:
>
>  Thanks Balaji. My hadoop environment is still running with spark 2.3. Can
> I
> run 0.6.0 on spark 2.3?
>
> For issue 1: I am able to manage it with spark glob read, instead of
> hive read. With this approach, I am good with this approach.
>  Issue 2: I see the performance issue while writing into the COW table.
> This is purely write and no read involved.  Attached the write logs (
> hudiLogs.txt) in the ticket . The more and more my target has partitions, I
> am noticing a spike in write time.  The fix #1919 mentioned is applicable
> for writing as well.
>
> On Fri, Aug 28, 2020 at 3:28 PM vbal...@apache.org 
> wrote:
>
> >  Hi Selvaraj,
> > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> read
> > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > please try 0.6.0
> > Balaji.V
> >On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1...@gmail.com> wrote:
> >
> >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > ticket
> > for tracking a couple of issues.
> >
> > One of the concerns I have in my use cases is that, have a COW type table
> > name called TRR.  I see below pasted logs rolling for all individual
> > partitions even though my write is on only a couple of partitions  and it
> > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> wondering
> > , in the future , I will have 3 years worth of data, and writing will be
> > very slow every time I write into only a couple of partitions.
> >
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@fed0a8b
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/01, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, m
> > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > hdfs-site.xml], FileSystem:
> > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > ugi=svchdc36q@V
> > ISA.COM (auth:KERBEROS)]]]
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@285c67a9
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/02, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > 

Re: HUDI-1232

2020-08-29 Thread Balaji Varadarajan
 From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs 
repeating which suggests this is the read side. So, we recommend you using 
latest version. I tried 2.3.3 and ran quickstart without issues. Give it a shot 
and let us know if there are any issues.
Balaji.V
On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy 
 wrote:  
 
 Thanks Balaji. My hadoop environment is still running with spark 2.3. Can I
run 0.6.0 on spark 2.3?

For issue 1: I am able to manage it with spark glob read, instead of
hive read. With this approach, I am good with this approach.
 Issue 2: I see the performance issue while writing into the COW table.
This is purely write and no read involved.  Attached the write logs (
hudiLogs.txt) in the ticket . The more and more my target has partitions, I
am noticing a spike in write time.  The fix #1919 mentioned is applicable
for writing as well.

On Fri, Aug 28, 2020 at 3:28 PM vbal...@apache.org 
wrote:

>  Hi Selvaraj,
> We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark read
> queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> please try 0.6.0
> Balaji.V
>    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1...@gmail.com> wrote:
>
>  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> ticket
> for tracking a couple of issues.
>
> One of the concerns I have in my use cases is that, have a COW type table
> name called TRR.  I see below pasted logs rolling for all individual
> partitions even though my write is on only a couple of partitions  and it
> takes upto 4 to 5  mins. I pasted only a few of them alone. I am wondering
> , in the future , I will have 3 years worth of data, and writing will be
> very slow every time I write into only a couple of partitions.
>
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@fed0a8b
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/01, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, m
> apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@V
> ISA.COM (auth:KERBEROS)]]]
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@285c67a9
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/02, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, m
> apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@V
> ISA.COM (auth:KERBEROS)]]]
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO