voonhous commented on issue #17744: URL: https://github.com/apache/hudi/issues/17744#issuecomment-3822180385
How to obtain raw variant and metadata bytes: Run the following pyspark scripts in - This will be STEP : https://github.com/apache/hudi/issues/17744#issuecomment-3822090390 https://github.com/apache/hudi/issues/17744#issuecomment-3710064394 Install parquet-tools in your virtualenv ```shell pip3 install parquet-tools ``` Then modify the `venv/lib/python-version/site-packages/parquet_tools/commands/show.py` Replace everything from `_execute` function with the code below: ```python def _execute(df: pd.DataFrame, format: str, head: int, columns: list) -> None: # head df_head: pd.DataFrame = df.head(head) if head > 0 else df # select columns df_select: pd.DataFrame = df_head[columns] if len(columns) else df_head df_hex = df_select.applymap(to_hex_recursive) print(tabulate(df_hex, df_select.columns, tablefmt=format, showindex=False)) print(tabulate(df_select, df_select.columns, tablefmt=format, showindex=False)) def to_hex_recursive(val): if val is None: return None if isinstance(val, (bytes, bytearray)): return bytes_to_hex_commas(val) if isinstance(val, dict): return {k: to_hex_recursive(v) for k, v in val.items()} if isinstance(val, (list, tuple)): return type(val)(to_hex_recursive(v) for v in val) return val def bytes_to_hex_commas(b: bytes) -> str: return ",".join(f"0x{x:02x}" for x in b) ``` Then run: `parquet-tools show output-to-parquet-STEP 1` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
