Re: Naming files while saving a Dataframe

Jörn Franke Sun, 18 Jul 2021 00:44:37 -0700

Spark heavily depends on Hadoop writing files. You can try to set the Hadoop 
property: mapreduce.output.basename


https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration--


> Am 18.07.2021 um 01:15 schrieb Eric Beabes <mailinglist...@gmail.com>:
> 
> 
> Mich - You're suggesting changing the "Path". Problem is that, we've an 
> EXTERNAL table created on top of this path so "Path" CANNOT change. If we 
> could, it would be easy to solve this problem. My question is about changing 
> the "Filename".
> 
> As Ayan pointed out, Spark doesn't seem to allow "prefixes" for the filenames!
> 
>> On Sat, Jul 17, 2021 at 1:58 PM Mich Talebzadeh <mich.talebza...@gmail.com> 
>> wrote:
>> Using this
>> 
>> df.write.mode("overwrite").format("parquet").saveAsTable("test.ABCD")
>> 
>> That will create a parquet table in the database test. which is essentially 
>> a hive partition in the format
>> 
>> /user/hive/warehouse/test.db/abcd/000000_0
>> 
>> 
>>    view my Linkedin profile
>> 
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>>> On Sat, 17 Jul 2021 at 20:45, Eric Beabes <mailinglist...@gmail.com> wrote:
>>> I am not sure if you've understood the question. Here's how we're saving 
>>> the DataFrame:
>>> 
>>> df
>>>   .coalesce(numFiles)
>>>   .write
>>>   .partitionBy(partitionDate)
>>>   .mode("overwrite")
>>>   .format("parquet")
>>>   .save(someDirectory)
>>> 
>>> Now where would I add a 'prefix' in this one?
>>> 
>>>> On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh 
>>>> <mich.talebza...@gmail.com> wrote:
>>>> try it see if it works
>>>> 
>>>> fullyQualifiedTableName = appName+'_'+tableName
>>>> 
>>>> 
>>>> 
>>>>    view my Linkedin profile
>>>> 
>>>>  
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>>>>  
>>>> 
>>>> 
>>>>> On Sat, 17 Jul 2021 at 18:02, Eric Beabes <mailinglist...@gmail.com> 
>>>>> wrote:
>>>>> I don't think Spark allows adding a 'prefix' to the file name, does it? 
>>>>> If it does, please tell me how. Thanks.
>>>>> 
>>>>>> On Sat, Jul 17, 2021 at 9:47 AM Mich Talebzadeh 
>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>> Jobs have names in spark. You can prefix it to the file name when 
>>>>>> writing to directory I guess
>>>>>> 
>>>>>>  val sparkConf = new SparkConf().
>>>>>>                setAppName(sparkAppName).
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>>    view my Linkedin profile
>>>>>> 
>>>>>>  
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>> damages arising from such loss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>>> On Sat, 17 Jul 2021 at 17:40, Eric Beabes <mailinglist...@gmail.com> 
>>>>>>> wrote:
>>>>>>> Reason we've two jobs writing to the same directory is that the data is 
>>>>>>> partitioned by 'day' (yyyymmdd) but the job runs hourly. Maybe the only 
>>>>>>> way to do this is to create an hourly partition (/yyyymmdd/hh). Is that 
>>>>>>> the only way to solve this?
>>>>>>> 
>>>>>>>> On Fri, Jul 16, 2021 at 5:45 PM ayan guha <guha.a...@gmail.com> wrote:
>>>>>>>> IMHO - this is a bad idea esp in failure scenarios. 
>>>>>>>> 
>>>>>>>> How about creating a subfolder each for the jobs? 
>>>>>>>> 
>>>>>>>>> On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes 
>>>>>>>>> <mailinglist...@gmail.com> wrote:
>>>>>>>>> We've two (or more) jobs that write data into the same directory via 
>>>>>>>>> a Dataframe.save method. We need to be able to figure out which job 
>>>>>>>>> wrote which file. Maybe provide a 'prefix' to the file names. I was 
>>>>>>>>> wondering if there's any 'option' that allows us to do this. Googling 
>>>>>>>>> didn't come up with any solution so thought of asking the Spark 
>>>>>>>>> experts on this mailing list.
>>>>>>>>> 
>>>>>>>>> Thanks in advance.
>>>>>>>> -- 
>>>>>>>> Best Regards,
>>>>>>>> Ayan Guha

Re: Naming files while saving a Dataframe

Reply via email to