Re: Spark File Output Committer algorithm for GCS

Dipayan Dev Fri, 21 Jul 2023 00:38:12 -0700

I used the following config and the performance has improved a lot.
.config("spark.sql.orc.splits.include.file.footer", true)


I am not able to find the default value of this config anywhere? Can
someone please share what's the default config of this- is it false?
Also just curious what this actually does?


With Best Regards,

Dipayan Dev


On Wed, Jul 19, 2023 at 2:25 PM Dipayan Dev <dev.dipaya...@gmail.com> wrote:

> Thank you. Will try out these options.
>
>
>
> With Best Regards,
>
>
>
> On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Sounds like if the mv command is inherently slow, there is little that
>> can be done.
>>
>> The only suggestion I can make is to create the staging table as
>> compressed to reduce its size and hence mv? Is that feasible? Also the
>> managed table can be created with SNAPPY compression
>>
>> STORED AS ORC
>> TBLPROPERTIES (
>> "orc.create.index"="true",
>> "orc.bloom.filter.columns"="KEY",
>> "orc.bloom.filter.fpp"="0.05",
>> "*orc.compress"="SNAPPY",*
>> "orc.stripe.size"="16777216",
>> "orc.row.index.stride"="10000" )
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 19 Jul 2023 at 02:35, Dipayan Dev <dev.dipaya...@gmail.com>
>> wrote:
>>
>>> Hi Mich,
>>> Ok, my use-case is a bit different.
>>> I have a Hive table partitioned by dates and need to do dynamic
>>> partition updates(insert overwrite) daily for the last 30 days
>>> (partitions).
>>> The ETL inside the staging directories is completed in hardly 5minutes,
>>> but then renaming takes a lot of time as it deletes and copies the
>>> partitions.
>>> My issue is something related to this -
>>> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>>>
>>>
>>> On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Spark has no role in creating that hive staging directory. That
>>>> directory belongs to Hive and Spark simply does ETL there, loading to the
>>>> Hive managed table in your case which ends up in saging directory
>>>>
>>>> I suggest that you review your design and use an external hive table
>>>> with explicit location on GCS with the date the data loaded. Then push that
>>>> data into the Hive managed table for today's partition.
>>>>
>>>> This was written in bash for Hive HQL itself but you can easily adapt
>>>> it for Spark
>>>>
>>>> TODAY="`date +%Y-%m-%d`"
>>>> DateStamp="${TODAY}"
>>>> CREATE EXTERNAL TABLE IF NOT EXISTS EXTERNALMARKETDATA (
>>>>      KEY string
>>>>    , TICKER string
>>>>    , TIMECREATED string
>>>>    , PRICE float
>>>> )
>>>> COMMENT 'From prices using Kafka delivered by Flume location by day'
>>>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>>>> STORED AS TEXTFILE
>>>> LOCATION 'gs://etcbucket/cloud_data_fusion/hive.../';
>>>>
>>>> --Keep track of daily ingestion into the external table.
>>>> ALTER TABLE EXTERNALMARKETDATA set location
>>>> 'gs://etcbucket/cloud_data_fusion/hive.../${TODAY}';
>>>>
>>>> -- create your managed table here and populate it from the Hive
>>>> external table
>>>> CREATE TABLE IF NOT EXISTS MARKETDATA (
>>>>      KEY string
>>>>    , TICKER string
>>>>    , TIMECREATED string
>>>>    , PRICE float
>>>>    , op_type int
>>>>    , op_time timestamp
>>>> )
>>>> PARTITIONED BY (DateStamp  string)
>>>> STORED AS ORC
>>>> TBLPROPERTIES (
>>>> "orc.create.index"="true",
>>>> "orc.bloom.filter.columns"="KEY",
>>>> "orc.bloom.filter.fpp"="0.05",
>>>> "orc.compress"="SNAPPY",
>>>> "orc.stripe.size"="16777216",
>>>> "orc.row.index.stride"="10000" )
>>>> ;
>>>>
>>>> --Populate target table
>>>> INSERT OVERWRITE TABLE MARKETDATA PARTITION (DateStamp = "${TODAY}")
>>>> SELECT
>>>>       KEY
>>>>     , TICKER
>>>>     , TIMECREATED
>>>>     , PRICE
>>>>     , 1
>>>>     , CAST(from_unixtime(unix_timestamp()) AS timestamp)
>>>> FROM EXTERNALMARKETDATA;
>>>>
>>>> ANALYZE TABLE MARKETDATA PARTITION (DateStamp) COMPUTE STATISTICS;
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 18 Jul 2023 at 18:22, Dipayan Dev <dev.dipaya...@gmail.com>
>>>> wrote:
>>>>
>>>>> It does help performance but not significantly.
>>>>>
>>>>> I am just wondering, once Spark creates that staging directory along
>>>>> with the SUCCESS file, can we just do a gsutil rsync command and move 
>>>>> these
>>>>> files to original directory? Anyone tried this approach or foresee any
>>>>> concern?
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 17 Jul 2023 at 9:47 PM, Dipayan Dev <dev.dipaya...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Jay, is there any suggestion how much I can increase those
>>>>>> parameters?
>>>>>>
>>>>>> On Mon, 17 Jul 2023 at 8:25 PM, Jay <jayadeep.jayara...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Fileoutputcommitter v2 is supported in GCS but the rename is a
>>>>>>> metadata copy and delete operation in GCS and therefore if there are 
>>>>>>> many
>>>>>>> number of files it will take a long time to perform this step. One
>>>>>>> workaround will be to create smaller number of larger files if that is
>>>>>>> possible from Spark and if this is not possible then those 
>>>>>>> configurations
>>>>>>> allow for configuring the threadpool which does the metadata copy.
>>>>>>>
>>>>>>> You can go through this table
>>>>>>> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores>
>>>>>>> to understand GCS performance implications.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> You said this Hive table was a managed table partitioned by date
>>>>>>>> -->${TODAY}
>>>>>>>>
>>>>>>>> How  do you define your Hive managed table?
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>> Palantir Technologies Limited
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev <dev.dipaya...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> It does support- It doesn’t error out for me atleast. But it took
>>>>>>>>> around 4 hours to finish the job.
>>>>>>>>>
>>>>>>>>> Interestingly, it took only 10 minutes to write the output in the
>>>>>>>>> staging directory and rest of the time it took to rename the objects. 
>>>>>>>>> Thats
>>>>>>>>> the concern.
>>>>>>>>>
>>>>>>>>> Looks like a known issue as spark behaves with GCS but not getting
>>>>>>>>> any workaround for this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park <yeachan...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version
>>>>>>>>>> 2 is supported on GCS? IIRC it wasn't, but you could check with GCP 
>>>>>>>>>> support
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <
>>>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Jay,
>>>>>>>>>>>
>>>>>>>>>>> I will try that option.
>>>>>>>>>>>
>>>>>>>>>>> Any insight on the file committer algorithms?
>>>>>>>>>>>
>>>>>>>>>>> I tried v2 algorithm but its not enhancing the runtime. What’s
>>>>>>>>>>> the best practice in Dataproc for dynamic updates in Spark.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay <
>>>>>>>>>>> jayadeep.jayara...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You can try increasing fs.gs.batch.threads and
>>>>>>>>>>>> fs.gs.max.requests.per.batch.
>>>>>>>>>>>>
>>>>>>>>>>>> The definitions for these flags are available here -
>>>>>>>>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <
>>>>>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have
>>>>>>>>>>>>> a managed Hive table on top of this.
>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>> When I do a dynamic partition update of Spark, it creates the
>>>>>>>>>>>>> new file in a Staging area as shown here.
>>>>>>>>>>>>> But the GCS blob renaming takes a lot of time. I have a
>>>>>>>>>>>>> partition based on dates and I need to update around 3 years of 
>>>>>>>>>>>>> data. It
>>>>>>>>>>>>> usually takes 3 hours to finish the process. Anyway to speed up 
>>>>>>>>>>>>> this?
>>>>>>>>>>>>> With Best Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dipayan Dev
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So you are using GCP and your Hive is installed on Dataproc
>>>>>>>>>>>>>> which happens to run your Spark as well. Is that correct?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What version of Hive are you using?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>>>>>>> Palantir Technologies Limited
>>>>>>>>>>>>>> London
>>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or 
>>>>>>>>>>>>>> any other
>>>>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>>>>> content is
>>>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for 
>>>>>>>>>>>>>> any
>>>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <
>>>>>>>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Of late, I have encountered the issue where I have to
>>>>>>>>>>>>>>> overwrite a lot of partitions of the Hive table through Spark. 
>>>>>>>>>>>>>>> It looks
>>>>>>>>>>>>>>> like writing to hive_staging_directory takes 25% of the total 
>>>>>>>>>>>>>>> time, whereas
>>>>>>>>>>>>>>> 75% or more time goes in moving the ORC files from staging 
>>>>>>>>>>>>>>> directory to the
>>>>>>>>>>>>>>> final partitioned directory structure.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I got some reference where it's mentioned to use this config
>>>>>>>>>>>>>>> during the Spark write.
>>>>>>>>>>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, it's also mentioned it's not safe as partial job
>>>>>>>>>>>>>>> failure might cause data loss.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there any suggestion on the pros and cons of using this
>>>>>>>>>>>>>>> version? Or any ongoing Spark feature development to address 
>>>>>>>>>>>>>>> this issue?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With Best Regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dipayan Dev
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> With Best Regards,
>>>>>>>>>>>
>>>>>>>>>>> Dipayan Dev
>>>>>>>>>>> Author of *Deep Learning with Hadoop
>>>>>>>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>>>>>>>> M.Tech (AI), IISc, Bangalore
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With Best Regards,
>>>>>>>>>
>>>>>>>>> Dipayan Dev
>>>>>>>>> Author of *Deep Learning with Hadoop
>>>>>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>>>>>> M.Tech (AI), IISc, Bangalore
>>>>>>>>>
>>>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>> With Best Regards,
>>>>>>
>>>>>> Dipayan Dev
>>>>>> Author of *Deep Learning with Hadoop
>>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>>> M.Tech (AI), IISc, Bangalore
>>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> With Best Regards,
>>>>>
>>>>> Dipayan Dev
>>>>> Author of *Deep Learning with Hadoop
>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>> M.Tech (AI), IISc, Bangalore
>>>>>
>>>>

Re: Spark File Output Committer algorithm for GCS

Reply via email to