Re: Spark File Output Committer algorithm for GCS

Mich Talebzadeh Wed, 19 Jul 2023 01:11:01 -0700

Sounds like if the mv command is inherently slow, there is little that can
be done.


The only suggestion I can make is to create the staging table as compressed
to reduce its size and hence mv? Is that feasible? Also the managed table
can be created with SNAPPY compression

STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="KEY",
"orc.bloom.filter.fpp"="0.05",
"*orc.compress"="SNAPPY",*
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 19 Jul 2023 at 02:35, Dipayan Dev <dev.dipaya...@gmail.com> wrote:

> Hi Mich,
> Ok, my use-case is a bit different.
> I have a Hive table partitioned by dates and need to do dynamic partition
> updates(insert overwrite) daily for the last 30 days (partitions).
> The ETL inside the staging directories is completed in hardly 5minutes,
> but then renaming takes a lot of time as it deletes and copies the
> partitions.
> My issue is something related to this -
> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
>
>
>
> With Best Regards,
>
> Dipayan Dev
>
>
>
> On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Spark has no role in creating that hive staging directory. That directory
>> belongs to Hive and Spark simply does ETL there, loading to the Hive
>> managed table in your case which ends up in saging directory
>>
>> I suggest that you review your design and use an external hive table with
>> explicit location on GCS with the date the data loaded. Then push that data
>> into the Hive managed table for today's partition.
>>
>> This was written in bash for Hive HQL itself but you can easily adapt it
>> for Spark
>>
>> TODAY="`date +%Y-%m-%d`"
>> DateStamp="${TODAY}"
>> CREATE EXTERNAL TABLE IF NOT EXISTS EXTERNALMARKETDATA (
>>      KEY string
>>    , TICKER string
>>    , TIMECREATED string
>>    , PRICE float
>> )
>> COMMENT 'From prices using Kafka delivered by Flume location by day'
>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>> STORED AS TEXTFILE
>> LOCATION 'gs://etcbucket/cloud_data_fusion/hive.../';
>>
>> --Keep track of daily ingestion into the external table.
>> ALTER TABLE EXTERNALMARKETDATA set location
>> 'gs://etcbucket/cloud_data_fusion/hive.../${TODAY}';
>>
>> -- create your managed table here and populate it from the Hive external
>> table
>> CREATE TABLE IF NOT EXISTS MARKETDATA (
>>      KEY string
>>    , TICKER string
>>    , TIMECREATED string
>>    , PRICE float
>>    , op_type int
>>    , op_time timestamp
>> )
>> PARTITIONED BY (DateStamp  string)
>> STORED AS ORC
>> TBLPROPERTIES (
>> "orc.create.index"="true",
>> "orc.bloom.filter.columns"="KEY",
>> "orc.bloom.filter.fpp"="0.05",
>> "orc.compress"="SNAPPY",
>> "orc.stripe.size"="16777216",
>> "orc.row.index.stride"="10000" )
>> ;
>>
>> --Populate target table
>> INSERT OVERWRITE TABLE MARKETDATA PARTITION (DateStamp = "${TODAY}")
>> SELECT
>>       KEY
>>     , TICKER
>>     , TIMECREATED
>>     , PRICE
>>     , 1
>>     , CAST(from_unixtime(unix_timestamp()) AS timestamp)
>> FROM EXTERNALMARKETDATA;
>>
>> ANALYZE TABLE MARKETDATA PARTITION (DateStamp) COMPUTE STATISTICS;
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 18 Jul 2023 at 18:22, Dipayan Dev <dev.dipaya...@gmail.com>
>> wrote:
>>
>>> It does help performance but not significantly.
>>>
>>> I am just wondering, once Spark creates that staging directory along
>>> with the SUCCESS file, can we just do a gsutil rsync command and move these
>>> files to original directory? Anyone tried this approach or foresee any
>>> concern?
>>>
>>>
>>>
>>> On Mon, 17 Jul 2023 at 9:47 PM, Dipayan Dev <dev.dipaya...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Jay, is there any suggestion how much I can increase those
>>>> parameters?
>>>>
>>>> On Mon, 17 Jul 2023 at 8:25 PM, Jay <jayadeep.jayara...@gmail.com>
>>>> wrote:
>>>>
>>>>> Fileoutputcommitter v2 is supported in GCS but the rename is a
>>>>> metadata copy and delete operation in GCS and therefore if there are many
>>>>> number of files it will take a long time to perform this step. One
>>>>> workaround will be to create smaller number of larger files if that is
>>>>> possible from Spark and if this is not possible then those configurations
>>>>> allow for configuring the threadpool which does the metadata copy.
>>>>>
>>>>> You can go through this table
>>>>> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores>
>>>>> to understand GCS performance implications.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> You said this Hive table was a managed table partitioned by date
>>>>>> -->${TODAY}
>>>>>>
>>>>>> How  do you define your Hive managed table?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev <dev.dipaya...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It does support- It doesn’t error out for me atleast. But it took
>>>>>>> around 4 hours to finish the job.
>>>>>>>
>>>>>>> Interestingly, it took only 10 minutes to write the output in the
>>>>>>> staging directory and rest of the time it took to rename the objects. 
>>>>>>> Thats
>>>>>>> the concern.
>>>>>>>
>>>>>>> Looks like a known issue as spark behaves with GCS but not getting
>>>>>>> any workaround for this.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park <yeachan...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2
>>>>>>>> is supported on GCS? IIRC it wasn't, but you could check with GCP 
>>>>>>>> support
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <
>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Jay,
>>>>>>>>>
>>>>>>>>> I will try that option.
>>>>>>>>>
>>>>>>>>> Any insight on the file committer algorithms?
>>>>>>>>>
>>>>>>>>> I tried v2 algorithm but its not enhancing the runtime. What’s the
>>>>>>>>> best practice in Dataproc for dynamic updates in Spark.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> You can try increasing fs.gs.batch.threads and
>>>>>>>>>> fs.gs.max.requests.per.batch.
>>>>>>>>>>
>>>>>>>>>> The definitions for these flags are available here -
>>>>>>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>>>>>>>>
>>>>>>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <
>>>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>>>>>>>>>>> managed Hive table on top of this.
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>> When I do a dynamic partition update of Spark, it creates the
>>>>>>>>>>> new file in a Staging area as shown here.
>>>>>>>>>>> But the GCS blob renaming takes a lot of time. I have a
>>>>>>>>>>> partition based on dates and I need to update around 3 years of 
>>>>>>>>>>> data. It
>>>>>>>>>>> usually takes 3 hours to finish the process. Anyway to speed up 
>>>>>>>>>>> this?
>>>>>>>>>>> With Best Regards,
>>>>>>>>>>>
>>>>>>>>>>> Dipayan Dev
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> So you are using GCP and your Hive is installed on Dataproc
>>>>>>>>>>>> which happens to run your Spark as well. Is that correct?
>>>>>>>>>>>>
>>>>>>>>>>>> What version of Hive are you using?
>>>>>>>>>>>>
>>>>>>>>>>>> HTH
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>>>>> Palantir Technologies Limited
>>>>>>>>>>>> London
>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>>>> other
>>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>>> content is
>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <
>>>>>>>>>>>> dev.dipaya...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Of late, I have encountered the issue where I have to
>>>>>>>>>>>>> overwrite a lot of partitions of the Hive table through Spark. It 
>>>>>>>>>>>>> looks
>>>>>>>>>>>>> like writing to hive_staging_directory takes 25% of the total 
>>>>>>>>>>>>> time, whereas
>>>>>>>>>>>>> 75% or more time goes in moving the ORC files from staging 
>>>>>>>>>>>>> directory to the
>>>>>>>>>>>>> final partitioned directory structure.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I got some reference where it's mentioned to use this config
>>>>>>>>>>>>> during the Spark write.
>>>>>>>>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, it's also mentioned it's not safe as partial job
>>>>>>>>>>>>> failure might cause data loss.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there any suggestion on the pros and cons of using this
>>>>>>>>>>>>> version? Or any ongoing Spark feature development to address this 
>>>>>>>>>>>>> issue?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> With Best Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dipayan Dev
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With Best Regards,
>>>>>>>>>
>>>>>>>>> Dipayan Dev
>>>>>>>>> Author of *Deep Learning with Hadoop
>>>>>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>>>>>> M.Tech (AI), IISc, Bangalore
>>>>>>>>>
>>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With Best Regards,
>>>>>>>
>>>>>>> Dipayan Dev
>>>>>>> Author of *Deep Learning with Hadoop
>>>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>>>> M.Tech (AI), IISc, Bangalore
>>>>>>>
>>>>>> --
>>>>
>>>>
>>>>
>>>> With Best Regards,
>>>>
>>>> Dipayan Dev
>>>> Author of *Deep Learning with Hadoop
>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>> M.Tech (AI), IISc, Bangalore
>>>>
>>> --
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>> Author of *Deep Learning with Hadoop
>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>> M.Tech (AI), IISc, Bangalore
>>>
>>

Re: Spark File Output Committer algorithm for GCS

Reply via email to