kolfild26 commented on issue #44513:
URL: https://github.com/apache/arrow/issues/44513#issuecomment-2544141432

   @zanmato1984 
   Stacktrace:
   ```bash
   Dec 16 01:07:44 kernel: python[37938]: segfault at 7f3004626050 ip 
00007f3fc25441cd sp 00007f3f10b09018 error 4 in 
libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: python[37971]: segfault at 7f3004626050 ip 
00007f3fc25441db sp 00007f3f002b0018 error 4
   Dec 16 01:07:44 kernel: python[37961]: segfault at 7f3004626050 ip 
00007f3fc25441cd sp 00007f3f052d0018 error 4 in 
libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: python[37957]: segfault at 7f3004626050 ip 
00007f3fc25441db sp 00007f3f072d8018 error 4
   Dec 16 01:07:44 kernel: python[37940]: segfault at 7f3004626050 ip 
00007f3fc25441cd sp 00007f3f0fb07018 error 4
   Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: python[37974]: segfault at 7f3004626050 ip 
00007f3fc25441cd sp 00007f3d18f6d018 error 4 in 
libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: python[37966]: segfault at 7f3004626050 ip 
00007f3fc25441db sp 00007f3f02abf018 error 4
   Dec 16 01:07:44 kernel: python[37951]: segfault at 7f3004626050 ip 
00007f3fc25441db sp 00007f3f0a2ec018 error 4
   Dec 16 01:07:44 kernel: python[37973]: segfault at 7f3004626050 ip 
00007f3fc25441cd sp 00007f3efb7fe018 error 4
   Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: python[37953]: segfault at 7f3004626050 ip 
00007f3fc25441db sp 00007f3f092e6018 error 4
   Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 kernel: in libarrow.so.1801[7f3fc1670000+2269000]
   Dec 16 01:07:44 abrt-hook-ccpp: Process 35963 (python3.10) of user 1000 
killed by SIGSEGV - dumping core
   ```
   
   Here is the tables's statistics:
   
   <details>
   
   <summary>Script to get stats</summary>
   
   ```python
   import pyarrow as pa
   import pyarrow.compute as pc
   import pandas as pd
   import pyarrow.types as patypes
   
   def get_column_distributions(table):
       distributions = {}
       total_rows = table.num_rows
   
       for column in table.schema.names:
           col_data = table[column]
           null_count = pc.sum(pc.is_null(col_data)).as_py()
           null_percentage = (null_count / total_rows) * 100 if total_rows > 0 
else 0
           
           # Compute the cardinality (unique count / total count)
           unique_count = 
pc.count_distinct(col_data.filter(pc.is_valid(col_data))).as_py()
           cardinality_percentage = round((unique_count / total_rows)*100,3) if 
total_rows > 0 else 0
           
           if patypes.is_integer(col_data.type) or 
patypes.is_floating(col_data.type):
               stats = {
                   "count": pc.count(col_data).as_py(),
                   "nulls": null_count,
                   "null_percentage": null_percentage,
                   "cardinality_percentage": cardinality_percentage,
                   "min": pc.min(col_data).as_py(),
                   "max": pc.max(col_data).as_py(),
               }
           elif patypes.is_string(col_data.type) or 
patypes.is_binary(col_data.type):
               value_counts = 
pc.value_counts(col_data.filter(pc.is_valid(col_data)))
               stats = {
                   "nulls": null_count,
                   "null_percentage": null_percentage,
                   "cardinality_percentage": cardinality_percentage,
                   "value_counts": value_counts.to_pandas().to_dict("records"),
               }
           else:
               stats = {
                   "nulls": null_count,
                   "null_percentage": null_percentage,
                   "cardinality_percentage": cardinality_percentage,
                   "message": f"Statistics not supported for type: 
{col_data.type}"
               }
   
           distributions[column] = stats
   
       return distributions
   ```
   </details>
   
   <details>
   
   <summary>small</summary>
   
   
![small](https://github.com/user-attachments/assets/3f8922e9-0048-4edd-9153-a95e1e081054)
   
   </details>
   <details>
   
   <summary>large</summary>
   
   
![large](https://github.com/user-attachments/assets/984b642a-b890-4dfd-97b8-229c7acb860e)
   
   </details>
   
   Would it be easier if I attached the tables here?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to