Re: [Python] Custom Metadata in PyArrow

Weston Pace Fri, 23 Apr 2021 13:06:10 -0700

I have used the custom metadata feature in the past.  I used it to
track (for example) which variables were independent variables and
which were dependent variables.  This was used as input for later
tools to help present the data.

> Is that how most people handle metadata they create the schema and field 
> first and then do with_metadata or do people construct schemas and fields 
> with metadata already?

Either approach should work fine.  It will probably depend on the
source of your data and the source of your metadata.  When I used the
feature I generated the metadata at the same time I generated the
data.  So when I created the schema I already had the metadata
available.

Later, I needed to support a case where users tweaked the metadata
manually.  When the request arrived I had to load the table, replace
the metadata (using with_metadata), and then save the table back out.

> Is that best/standard practice or just legacy code from people using Pandas 
> and starting out I should just work in PyArrow exclusively?

Pyarrow does not currently do everything that pandas does.
Furthermore, there are no plans (that I know of) to bring over many of
the features in pandas.  For example, statistical tests, histogram
plotting, etc.  If you are doing general purpose data analytics it is
highly likely you will want to use pandas.  Knowing exactly where to
draw the line with functionality can be tricky and different people on
this ML will probably have different opinions.

On Fri, Apr 23, 2021 at 2:50 AM Michael Lavina
<[email protected]> wrote:
>
> Hello Team,
>
> The docs for Custom Metadata in PyArrow say TODO 
> https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata
> So I am wondering if someone has any example of adding some custom metadata 
> to PyArrow.
>
> I tried looking through the pyarrow github repo and looking at they handle 
> pandas conversion because I know that uses metadata and see they do this
> ```
> metadata.update(pandas_metadata)
> schema = schema.with_metadata(metadata)
> ```
>
> Is that how most people handle metadata they create the schema and field 
> first and then do with_metadata or do people construct schemas and fields 
> with metadata already?
>
> I guess a follow up as well is that most examples I see use Pandas to start 
> and convert to arrow using `from_pandas`.
>
> Is that best/standard practice or just legacy code from people using Pandas 
> and starting out I should just work in PyArrow exclusively? For context, I am 
> more interested in ease of use and less so performance, but I don’t want to 
> not care at memory/performance cost at all.
>
> Thanks again everyone,
> Michael

Re: [Python] Custom Metadata in PyArrow

Reply via email to