Re: [Python] Custom Metadata in PyArrow

Wes McKinney Fri, 23 Apr 2021 13:16:05 -0700

On Fri, Apr 23, 2021 at 3:06 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> I have used the custom metadata feature in the past.  I used it to
> track (for example) which variables were independent variables and
> which were dependent variables.  This was used as input for later
> tools to help present the data.
>
> > Is that how most people handle metadata they create the schema and field 
> > first and then do with_metadata or do people construct schemas and fields 
> > with metadata already?
>
> Either approach should work fine.  It will probably depend on the
> source of your data and the source of your metadata.  When I used the
> feature I generated the metadata at the same time I generated the
> data.  So when I created the schema I already had the metadata
> available.
>
> Later, I needed to support a case where users tweaked the metadata
> manually.  When the request arrived I had to load the table, replace
> the metadata (using with_metadata), and then save the table back out.
>
> > Is that best/standard practice or just legacy code from people using Pandas 
> > and starting out I should just work in PyArrow exclusively?
>
> Pyarrow does not currently do everything that pandas does.
> Furthermore, there are no plans (that I know of) to bring over many of
> the features in pandas.  For example, statistical tests, histogram
> plotting, etc.  If you are doing general purpose data analytics it is
> highly likely you will want to use pandas.  Knowing exactly where to
> draw the line with functionality can be tricky and different people on
> this ML will probably have different opinions.


Note that it's our goal for pyarrow to be a "backend" library
providing computation, IO, and memory management facilities, where
pandas is an integrated backend AND frontend library, if that makes
sense. So we want people to use pyarrow to build other user-facing
"frontend" projects.

> On Fri, Apr 23, 2021 at 2:50 AM Michael Lavina
> <michael.lav...@factset.com> wrote:
> >
> > Hello Team,
> >
> > The docs for Custom Metadata in PyArrow say TODO 
> > https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata
> > So I am wondering if someone has any example of adding some custom metadata 
> > to PyArrow.
> >
> > I tried looking through the pyarrow github repo and looking at they handle 
> > pandas conversion because I know that uses metadata and see they do this
> > ```
> > metadata.update(pandas_metadata)
> > schema = schema.with_metadata(metadata)
> > ```
> >
> > Is that how most people handle metadata they create the schema and field 
> > first and then do with_metadata or do people construct schemas and fields 
> > with metadata already?
> >
> > I guess a follow up as well is that most examples I see use Pandas to start 
> > and convert to arrow using `from_pandas`.
> >
> > Is that best/standard practice or just legacy code from people using Pandas 
> > and starting out I should just work in PyArrow exclusively? For context, I 
> > am more interested in ease of use and less so performance, but I don’t want 
> > to not care at memory/performance cost at all.
> >
> > Thanks again everyone,
> > Michael

Re: [Python] Custom Metadata in PyArrow

Reply via email to