kolfild26 commented on issue #44513: URL: https://github.com/apache/arrow/issues/44513#issuecomment-2544141432
@zanmato1984 Stacktrace: ```bash Dec 16 01:07:44 kernel: python[37938]: segfault at 7f3004626050 ip 00007f3fc25441cd sp 00007f3f10b09018 error 4 in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: python[37971]: segfault at 7f3004626050 ip 00007f3fc25441db sp 00007f3f002b0018 error 4 Dec 16 01:07:44 kernel: python[37961]: segfault at 7f3004626050 ip 00007f3fc25441cd sp 00007f3f052d0018 error 4 in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: python[37957]: segfault at 7f3004626050 ip 00007f3fc25441db sp 00007f3f072d8018 error 4 Dec 16 01:07:44 kernel: python[37940]: segfault at 7f3004626050 ip 00007f3fc25441cd sp 00007f3f0fb07018 error 4 Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: python[37974]: segfault at 7f3004626050 ip 00007f3fc25441cd sp 00007f3d18f6d018 error 4 in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: python[37966]: segfault at 7f3004626050 ip 00007f3fc25441db sp 00007f3f02abf018 error 4 Dec 16 01:07:44 kernel: python[37951]: segfault at 7f3004626050 ip 00007f3fc25441db sp 00007f3f0a2ec018 error 4 Dec 16 01:07:44 kernel: python[37973]: segfault at 7f3004626050 ip 00007f3fc25441cd sp 00007f3efb7fe018 error 4 Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: python[37953]: segfault at 7f3004626050 ip 00007f3fc25441db sp 00007f3f092e6018 error 4 Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000] Dec 16 01:07:44 abrt-hook-ccpp: Process 35963 (python3.10) of user 1000 killed by SIGSEGV - dumping core ``` Here is the tables's statistics: <details> <summary>Script to get stats</summary> ```python import pyarrow as pa import pyarrow.compute as pc import pandas as pd import pyarrow.types as patypes def get_column_distributions(table): distributions = {} total_rows = table.num_rows for column in table.schema.names: col_data = table[column] null_count = pc.sum(pc.is_null(col_data)).as_py() null_percentage = (null_count / total_rows) * 100 if total_rows > 0 else 0 # Compute the cardinality (unique count / total count) unique_count = pc.count_distinct(col_data.filter(pc.is_valid(col_data))).as_py() cardinality_percentage = round((unique_count / total_rows)*100,3) if total_rows > 0 else 0 if patypes.is_integer(col_data.type) or patypes.is_floating(col_data.type): stats = { "count": pc.count(col_data).as_py(), "nulls": null_count, "null_percentage": null_percentage, "cardinality_percentage": cardinality_percentage, "min": pc.min(col_data).as_py(), "max": pc.max(col_data).as_py(), } elif patypes.is_string(col_data.type) or patypes.is_binary(col_data.type): value_counts = pc.value_counts(col_data.filter(pc.is_valid(col_data))) stats = { "nulls": null_count, "null_percentage": null_percentage, "cardinality_percentage": cardinality_percentage, "value_counts": value_counts.to_pandas().to_dict("records"), } else: stats = { "nulls": null_count, "null_percentage": null_percentage, "cardinality_percentage": cardinality_percentage, "message": f"Statistics not supported for type: {col_data.type}" } distributions[column] = stats return distributions ``` </details> <details> <summary>small</summary>  </details> <details> <summary>large</summary>  </details> Would it be easier if I attached the tables here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org