[ 
https://issues.apache.org/jira/browse/ARROW-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342514#comment-17342514
 ] 

David Li commented on ARROW-11903:
----------------------------------

[~bioinfornatics] did 3.0.0 solve your problem? There's also 4.0.0 which 
released recently.

> Stored data to parquet do not fit values before the storing
> -----------------------------------------------------------
>
>                 Key: ARROW-11903
>                 URL: https://issues.apache.org/jira/browse/ARROW-11903
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Archery
>    Affects Versions: 2.0.0
>            Reporter: Jonathan mercier
>            Priority: Major
>
> Dear,
>  
> I have a strange behavior, indeed data do not keep their same values once 
> stored to parquet.
>  
> the schema is:
>  
> {code:python}
> variations = struct((field('start', int64(), nullable=False),
>                      field('stop', int64(), nullable=False),
>                      field('reference', string(), nullable=False),
>                      field('alternative', string(), nullable=False),
>                      field('category', int8(), nullable=False)))
> variations_field = field('variations', list_(variations))
> metadata = {b'pandas': b'{"index_columns": ["sample"], '
>  b'"column_indexes": [{"name": null, "field_name": "sample", "pandas_type": 
> "int64", "numpy_type": "int64"}], '
>  b'"columns": ['
>  b'{"name": "variations", "field_name": "variations", "pandas_type": 
> "list[object]", "numpy_type": "object", "metadata": null}, '
>  b'{"name": "sample", "field_name": "sample", "pandas_type": "int64", 
> "numpy_type": "int64", "metadata": null}], '
>  b'"pandas_version": "1.2.0"}'}
> sample_to_variations_schema = schema((sample_field, variations_field), 
> metadata=metadata)
> {code}
>  
> To store data I do:
> {code:python}
> table = Table.from_arrays([samples, variations_by_sample], 
> schema=sample_to_variations_schema)
> dataset_dir = path.join(outdir, f'contig={contig}')
> makedirs(dataset_dir, exist_ok=True)
> with ParquetWriter(where=path.join(dataset_dir, 'variant_to_samples'),
> version='2.0', schema=table.schema, compression='SNAPPY') as pw:
>     pw.write_table(table){code}
> I put a breakpoint just after table is assigned, in order to check values in 
> memory:
> Example for the row n°210027
> {code:python}
> >>> samples[210027]
> 831028
> >>> variations_by_sample[210027]
> [(241, 241, 'C', 'T', 0), (445, 445, 'T', 'C', 0), (3037, 3037, 'C', 'T', 0), 
> (6286, 6286, 'C', 'T', 0), (11024, 11024, 'A', 'G', 0), (14408, 14408, 'C', 
> 'T', 0), (21255, 21255, 'G', 'C', 0), (22227, 22227, 'C', 'T', 0), (23403, 
> 23403, 'A', 'G', 0), (24140, 24140, 'G', 'A', 0), (25496, 25496, 'T', 'C', 
> 0), (26801, 26801, 'C', 'G', 0), (27840, 27840, 'T', 'C', 0), (27944, 27944, 
> 'C', 'T', 0), (27948, 27948, 'G', 'T', 0), (28932, 28932, 'C', 'T', 0), 
> (29645, 29645, 'G', 'T', 0)]
> {code}
> Now the application end successfully and data are stored into a parquet 
> dataset.
> So, I load those data and check their consistencies.
> {code:python}
> $ ipython
> In [1]: from pyarrow.parquet import read_table
>    ...: sample_to_variants = read_table('sample_to_variants_db')
> In [2]: row_num = 0
>    ...: an_id = 0
>    ...: while an_id != 831028:
>    ...:     an_id = sample_to_variants.column(0)[row_num].as_py()
>    ...:     row_num += 1
>    ...: 
> In [3]: sample_to_variants.column(0)[row_num-1].as_py()
> Out[3]: 831028
> In [4]: sample_to_variants.column(1)[row_num-1].as_py()
> Out[4]: 
> [{'start': 241,
>   'stop': 241,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 445,
>   'stop': 445,
>   'reference': 'G',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 3037,
>   'stop': 3037,
>   'reference': 'G',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 6286,
>   'stop': 6286,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 11024,
>   'stop': 11024,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 14408,
>   'stop': 14408,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 21255,
>   'stop': 21255,
>   'reference': 'G',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 22227,
>   'stop': 22227,
>   'reference': 'G',
>   'alternative': 'A',
>   'category': 0},
>  {'start': 23403,
>   'stop': 23403,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 24140,
>   'stop': 24140,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 25496,
>   'stop': 25496,
>   'reference': 'A',
>   'alternative': 'G',
>   'category': 0},
>  {'start': 26801,
>   'stop': 26801,
>   'reference': 'G',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 27840,
>   'stop': 27840,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 27944,
>   'stop': 27944,
>   'reference': 'T',
>   'alternative': 'C',
>   'category': 0},
>  {'start': 27948,
>   'stop': 27948,
>   'reference': 'G',
>   'alternative': 'A',
>   'category': 0},
>  {'start': 28932,
>   'stop': 28932,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
>  {'start': 29645,
>   'stop': 29645,
>   'reference': 'G',
>   'alternative': 'A',
>   'category': 0}]
> {code}
> We can see that the column 1 (0 based) do not have the same value before to 
> be written in parquet. 
> As example into parquet dataset I have this value:
> {code:python}
>  {'start': 24140,
>   'stop': 24140,
>   'reference': 'C',
>   'alternative': 'T',
>   'category': 0},
> {code}
> while from the memory before to be stored:
> {code:python}
> (24140, 24140, 'G', 'A', 0)
> {code}
> I do not understand what is the mechanism which lead to this inconsistency.
> So I am not able to make a minimal example case (sorry)
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to