Re: Spark File Output Committer algorithm for GCS

Dipayan Dev Wed, 19 Jul 2023 01:56:00 -0700

Thank you. Will try out these options.



With Best Regards,



On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Sounds like if the mv command is inherently slow, there is little that can
> be done.
>
> The only suggestion I can make is to create the staging table as
> compressed to reduce its size and hence mv? Is that feasible? Also the
> managed table can be created with SNAPPY compression
>
> STORED AS ORC
> TBLPROPERTIES (
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="KEY",
> "orc.bloom.filter.fpp"="0.05",
> "*orc.compress"="SNAPPY",*
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="10000" )
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 19 Jul 2023 at 02:35, Dipayan Dev <dev.dipaya...@gmail.com> wrote:
>
>> Hi Mich,
>> Ok, my use-case is a bit different.
>> I have a Hive table partitioned by dates and need to do dynamic partition
>> updates(insert overwrite) daily for the last 30 days (partitions).
>> The ETL inside the staging directories is completed in hardly 5minutes,
>> but then renaming takes a lot of time as it deletes and copies the
>> partitions.
>> My issue is something related to this -
>> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>>
>>
>> On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Spark has no role in creating that hive staging directory. That
>>> directory belongs to Hive and Spark simply does ETL there, loading to the
>>> Hive managed table in your case which ends up in saging directory
>>>
>>> I suggest that you review your design and use an external hive table
>>> with explicit location on GCS with the date the data loaded. Then push that
>>> data into the Hive managed table for today's partition.
>>>
>>> This was written in bash for Hive HQL itself but you can easily adapt it
>>> for Spark
>>>
>>> TODAY="`date +%Y-%m-%d`"
>>> DateStamp="${TODAY}"
>>> CREATE EXTERNAL TABLE IF NOT EXISTS EXTERNALMARKETDATA (
>>>      KEY string
>>>    , TICKER string
>>>    , TIMECREATED string
>>>    , PRICE float
>>> )
>>> COMMENT 'From prices using Kafka delivered by Flume location by day'
>>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>>> STORED AS TEXTFILE
>>> LOCATION 'gs://etcbucket/cloud_data_fusion/hive.../';
>>>
>>> --Keep track of daily ingestion into the external table.
>>> ALTER TABLE EXTERNALMARKETDATA set location
>>> 'gs://etcbucket/cloud_data_fusion/hive.../${TODAY}';
>>>
>>> -- create your managed table here and populate it from the Hive external
>>> table
>>> CREATE TABLE IF NOT EXISTS MARKETDATA (
>>>      KEY string
>>>    , TICKER string
>>>    , TIMECREATED string
>>>    , PRICE float
>>>    , op_type int
>>>    , op_time timestamp
>>> )
>>> PARTITIONED BY (DateStamp  string)
>>> STORED AS ORC
>>> TBLPROPERTIES (
>>> "orc.create.index"="true",
>>> "orc.bloom.filter.columns"="KEY",
>>> "orc.bloom.filter.fpp"="0.05",
>>> "orc.compress"="SNAPPY",
>>> "orc.stripe.size"="16777216",
>>> "orc.row.index.stride"="10000" )
>>> ;
>>>
>>> --Populate target table
>>> INSERT OVERWRITE TABLE MARKETDATA PARTITION (DateStamp = "${TODAY}")
>>> SELECT
>>>       KEY
>>>     , TICKER
>>>     , TIMECREATED
>>>     , PRICE
>>>     , 1
>>>     , CAST(from_unixtime(unix_timestamp()) AS timestamp)
>>> FROM EXTERNALMARKETDATA;
>>>
>>> ANALYZE TABLE MARKETDATA PARTITION (DateStamp) COMPUTE STATISTICS;
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 18 Jul 2023 at 18:22, Dipayan Dev <dev.dipaya...@gmail.com>
>>> wrote:
>>>
>>>> It does help performance but not significantly.
>>>>
>>>> I am just wondering, once Spark creates that staging directory along
>>>> with the SUCCESS file, can we just do a gsutil rsync command and move these
>>>> files to original directory? Anyone tried this approach or foresee any
>>>> concern?
>>>>
>>>>
>>>>
>>>> On Mon, 17 Jul 2023 at 9:47 PM, Dipayan Dev <dev.dipaya...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Jay, is there any suggestion how much I can increase those
>>>>> parameters?
>>>>>
>>>>> On Mon, 17 Jul 2023 at 8:25 PM, Jay <jayadeep.jayara...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Fileoutputcommitter v2 is supported in GCS but the rename is a
>>>>>> metadata copy and delete operation in GCS and therefore if there are many
>>>>>> number of files it will take a long time to perform this step. One
>>>>>> workaround will be to create smaller number of larger files if that is
>>>>>> possible from Spark and if this is not possible then those configurations
>>>>>> allow for configuring the threadpool which does the metadata copy.
>>>>>>
>>>>>> You can go through this table
>>>>>> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores>
>>>>>> to understand GCS performance implications.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> You said this Hive table was a managed table partitioned by date
>>>>>>> -->${TODAY}
>>>>>>>
>>>>>>> How  do you define your Hive managed table?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> Palantir Technologies Limited
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev <dev.dipaya...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> It does support- It doesn’t error out for me atleast. But it took
>>>>>>>> around 4 hours to finish the job.
>>>>>>>>
>>>>>>>> Interestingly, it took only 10 minutes to write the output in the
>>>>>>>> staging directory and rest of the time it took to rename the objects. 
>>>>>>>> Thats
>>>>>>>> the concern.
>>>>>>>>
>>>>>>>> Looks like a known issue as spark behaves with GCS but not getting
>>>>>>>> any workaround for this.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park <yeachan...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2
>>>>>>>>> is supported on GCS? IIRC it wasn't, but you could check with GCP 
>>>>>>>>> support
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <
>>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Jay,
>>>>>>>>>>
>>>>>>>>>> I will try that option.
>>>>>>>>>>
>>>>>>>>>> Any insight on the file committer algorithms?
>>>>>>>>>>
>>>>>>>>>> I tried v2 algorithm but its not enhancing the runtime. What’s
>>>>>>>>>> the best practice in Dataproc for dynamic updates in Spark.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> You can try increasing fs.gs.batch.threads and
>>>>>>>>>>> fs.gs.max.requests.per.batch.
>>>>>>>>>>>
>>>>>>>>>>> The definitions for these flags are available here -
>>>>>>>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <
>>>>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have
>>>>>>>>>>>> a managed Hive table on top of this.
>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>> When I do a dynamic partition update of Spark, it creates the
>>>>>>>>>>>> new file in a Staging area as shown here.
>>>>>>>>>>>> But the GCS blob renaming takes a lot of time. I have a
>>>>>>>>>>>> partition based on dates and I need to update around 3 years of 
>>>>>>>>>>>> data. It
>>>>>>>>>>>> usually takes 3 hours to finish the process. Anyway to speed up 
>>>>>>>>>>>> this?
>>>>>>>>>>>> With Best Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Dipayan Dev
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> So you are using GCP and your Hive is installed on Dataproc
>>>>>>>>>>>>> which happens to run your Spark as well. Is that correct?
>>>>>>>>>>>>>
>>>>>>>>>>>>> What version of Hive are you using?
>>>>>>>>>>>>>
>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>>>>>> Palantir Technologies Limited
>>>>>>>>>>>>> London
>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>>>>> other
>>>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>>>> content is
>>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for 
>>>>>>>>>>>>> any
>>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <
>>>>>>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Of late, I have encountered the issue where I have to
>>>>>>>>>>>>>> overwrite a lot of partitions of the Hive table through Spark. 
>>>>>>>>>>>>>> It looks
>>>>>>>>>>>>>> like writing to hive_staging_directory takes 25% of the total 
>>>>>>>>>>>>>> time, whereas
>>>>>>>>>>>>>> 75% or more time goes in moving the ORC files from staging 
>>>>>>>>>>>>>> directory to the
>>>>>>>>>>>>>> final partitioned directory structure.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I got some reference where it's mentioned to use this config
>>>>>>>>>>>>>> during the Spark write.
>>>>>>>>>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, it's also mentioned it's not safe as partial job
>>>>>>>>>>>>>> failure might cause data loss.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there any suggestion on the pros and cons of using this
>>>>>>>>>>>>>> version? Or any ongoing Spark feature development to address 
>>>>>>>>>>>>>> this issue?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With Best Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dipayan Dev
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> With Best Regards,
>>>>>>>>>>
>>>>>>>>>> Dipayan Dev
>>>>>>>>>> Author of *Deep Learning with Hadoop
>>>>>>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>>>>>>> M.Tech (AI), IISc, Bangalore
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With Best Regards,
>>>>>>>>
>>>>>>>> Dipayan Dev
>>>>>>>> Author of *Deep Learning with Hadoop
>>>>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>>>>> M.Tech (AI), IISc, Bangalore
>>>>>>>>
>>>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> With Best Regards,
>>>>>
>>>>> Dipayan Dev
>>>>> Author of *Deep Learning with Hadoop
>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>> M.Tech (AI), IISc, Bangalore
>>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> With Best Regards,
>>>>
>>>> Dipayan Dev
>>>> Author of *Deep Learning with Hadoop
>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>> M.Tech (AI), IISc, Bangalore
>>>>
>>>

Re: Spark File Output Committer algorithm for GCS

Reply via email to