[
https://issues.apache.org/jira/browse/HUDI-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397525#comment-17397525
]
Nikita Sheremet commented on HUDI-1363:
---------------------------------------
h3. [~vinoth]
Conversation history from emails:
{quote}
Hi Vinoth,
Thank you for your assistance.
Let me state our use case. I'll post it on Jira.
We have ~100TB Data Lake with personal data.
We use Google BigQuery Omni to query the data because not every tool can query
such a volume of data quickly.
Also, we need to remove personal data from time to time, so we've decided to
use Hudi.
The problem is we converted the data from regular parquet to Hudi parquet and
BigQuery couldn't build its index anymore and query data efficiently.
We got an error:
BigQuery error in mk operation: Error while reading table:
data_detections_hudi, error message: Failed to add partition key y (type:
TYPE_INT64) to schema, because another column with the same name was already
present. This is not allowed. Full partition schema: [y:TYPE_INT64,
m:TYPE_INT64, d:TYPE_INT64].
So we need to get rid of Y,M,D columns somehow. Seems like the PR might solve
our problem.
Unfortunately, we can't use Hudi due to this issue.
We would really appreciate it if could you help us to solve the issue.{quote}
{quote}Thanks, that helps a lot! are you using
[https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs] ?
y, m, d are already present in the input data frame and you are doing a
`df.write.format("hudi").partitionBy("y", "m", "d")` ?
I think the spark parquet write explicitly removes this. We keep it so, you can
say repartition later and build different range indexes lets say off the file.
But, I understand BQ expects this. I am also thinking about workarounds, where
we can name the partition columns differently than y, m, d .
Let's continue on the JIRA if you don't mind. {quote}
"I think the spark parquet write explicitly removes this. "
If we speak about just columns - yes. But switching to pure parquet writer
removes all hudi features like indexing and fast deleting records by id.
To say the truth it is very hard to do something with workarounds and keep hudi
features. For example it is impossible to read (via hudi) from full path (like
mytable/y=2020/m=01/d=01) remove columns from parquet and write to the same
path directly - hudi start to figure out the table in hive catalog and failed
without touching any data.
> Provide Option to drop columns after they are used to generate partition or
> record keys
> ---------------------------------------------------------------------------------------
>
> Key: HUDI-1363
> URL: https://issues.apache.org/jira/browse/HUDI-1363
> Project: Apache Hudi
> Issue Type: New Feature
> Components: Writer Core
> Reporter: Balaji Varadarajan
> Assignee: liwei
> Priority: Major
> Fix For: 0.9.0
>
>
> Context: https://github.com/apache/hudi/issues/2213
--
This message was sent by Atlassian Jira
(v8.3.4#803005)