[ https://issues.apache.org/jira/browse/ARROW-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17609948#comment-17609948 ]
Joris Van den Bossche commented on ARROW-17834: ----------------------------------------------- > One additional tricky thing here is what if the storage array also need > additional arguments. Hmm, such a case wouldn't be solved by this simple solution. I was thinking that one possible solution for this would be to encode this dictionary in the actual extension type (eg that you need or can pass it to the type constructor, like {{LabelType(dictionary=...)}}), and then the cast could take care of that. However, in arrow the dictionary is part of the data, not the type, so casting to an extension type (under the hood casting to the storage type) won't actually do any checking of dictionary values. For such a use case, you would still have to manually create the storage array first (and in this case actually manually create the DictionaryArray with passing the indices and dictionary manually, to ensure you use a certain dictionary array), before converting to an extension array. The only way I can think of to enable control over this for the extension array author, would be to add a method like {{\_\_arrow_construct_storage_array\_\_}} to the extension type, so that we can call this instead of doing a {{pa.array(data, type=ext_type.storage_type)}}. But I am not fully sure this is useful enough in general to warrant adding such a method (a more general mechanism that might be interesting to add is to enable to register custom cast methods). > [Python] Allow creating ExtensionArray through pa.array(..) constructor > ----------------------------------------------------------------------- > > Key: ARROW-17834 > URL: https://issues.apache.org/jira/browse/ARROW-17834 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Joris Van den Bossche > Priority: Major > > Currently, creating an ExtensionArray from a python sequence (or numpy array, > ..) requires the following: > {code:python} > from pyarrow.tests.test_extension_type import IntegerType > storage_array = pa.array([1, 2, 3]) > ext_arr = pa.ExtensionArray.from_storage(IntegerType(), storage_array) > {code} > While doing this directly in {{pa.array(..)}} doesn't work: > {code:python} > >>> pa.array([1, 2, 3], type=IntegerType()) > ArrowNotImplementedError: extension > {code} > I think it should be possible to basically to the ExtensionArray.from_storage > under the hood in {{pa.array(..)}} when the specified type is an extension > type? > I think this should also enable converting from a pandas DataFrame (with a > column with matching storage values) to a Table with a specified schema that > includes an extension type. Like: > {code} > df = pd.DataFrame({'a': [1, 2, 3]}) > pa.table(df, schema=pa.schema([('a', IntegerType())])) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)