Thomas Buhrmann created ARROW-7168: -------------------------------------- Summary: pa.array() doesn't respect provided dictionary type with all NaNs Key: ARROW-7168 URL: https://issues.apache.org/jira/browse/ARROW-7168 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.15.1 Reporter: Thomas Buhrmann
This might be related to ARROW-6548 and others dealing with all NaN columns. When creating a dictionary array, even when fully specifying the desired type, this type is not respected when the data contains only NaNs: {code:python} # This may look a little artificial but easily occurs when processing categorial data in batches and a particular batch containing only NaNs ser = pd.Series([None, None]).astype('object').astype('category') typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False) pa.array(ser, type=typ).type {code} results in {noformat} >> DictionaryType(dictionary<values=null, indices=int8, ordered=0>) {noformat} which means that one cannot e.g. serialize batches of categoricals if the possibility of all-NaN batches exists, even when trying to enforce that each batch has the same schema (because the schema is not respected). I understand that inferring the type in this case would be difficult, but I'd imagine that a fully specified type should be respected in this case? In the meantime, is there a workaround to manually create a dictionary array of the desired type containing only NaNs? -- This message was sent by Atlassian Jira (v8.3.4#803005)