[jira] [Commented] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys

Nikita Sheremet (Jira) Wed, 11 Aug 2021 11:04:14 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397525#comment-17397525
 ]


Nikita Sheremet commented on HUDI-1363:
---------------------------------------

h3. [~vinoth] 

Conversation history from emails:
{quote} 

Hi Vinoth,

Thank you for your assistance.


Let me state our use case. I'll post it on Jira.
We have ~100TB Data Lake with personal data.
We use Google BigQuery Omni to query the data because not every tool can query 
such a volume of data quickly.
Also, we need to remove personal data from time to time, so we've decided to 
use Hudi.
The problem is we converted the data from regular parquet to Hudi parquet and 
BigQuery couldn't build its index anymore and query data efficiently.
We got an error:
BigQuery error in mk operation: Error while reading table: 
data_detections_hudi, error message: Failed to add partition key y (type: 
TYPE_INT64) to schema, because another column with the same name was already 
present. This is not allowed. Full partition schema: [y:TYPE_INT64, 
m:TYPE_INT64, d:TYPE_INT64]. 
So we need to get rid of Y,M,D columns somehow. Seems like the PR might solve 
our problem.
Unfortunately, we can't use Hudi due to this issue.
We would really appreciate it if could you help us to solve the issue.{quote}
 
 
{quote}Thanks, that helps a lot! are you using 
[https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs] ? 
y, m, d are already present in the input data frame and you are doing a 
`df.write.format("hudi").partitionBy("y", "m", "d")` ? 
I think the spark parquet write explicitly removes this. We keep it so, you can 
say repartition later and build different range indexes lets say off the file. 
But, I understand BQ expects this. I am also thinking about workarounds, where 
we can name the partition columns differently than y, m, d . 
Let's continue on the JIRA if you don't mind.  {quote}
 
"I think the spark parquet write explicitly removes this. "
If we speak about just columns - yes. But switching to pure parquet writer 
removes all hudi features like indexing and fast deleting records by id.
To say the truth it is very hard to do something with workarounds and keep hudi 
features. For example it is impossible to read (via hudi) from full path (like 
mytable/y=2020/m=01/d=01) remove columns from parquet and write to the same 
path directly - hudi start to figure out the table in hive catalog and failed 
without touching any data. 

> Provide Option to drop columns after they are used to generate partition or 
> record keys
> ---------------------------------------------------------------------------------------
>
>                 Key: HUDI-1363
>                 URL: https://issues.apache.org/jira/browse/HUDI-1363
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Writer Core
>            Reporter: Balaji Varadarajan
>            Assignee: liwei
>            Priority: Major
>             Fix For: 0.9.0
>
>
> Context: https://github.com/apache/hudi/issues/2213



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys

Reply via email to