I have used the custom metadata feature in the past. I used it to track (for example) which variables were independent variables and which were dependent variables. This was used as input for later tools to help present the data.
> Is that how most people handle metadata they create the schema and field > first and then do with_metadata or do people construct schemas and fields > with metadata already? Either approach should work fine. It will probably depend on the source of your data and the source of your metadata. When I used the feature I generated the metadata at the same time I generated the data. So when I created the schema I already had the metadata available. Later, I needed to support a case where users tweaked the metadata manually. When the request arrived I had to load the table, replace the metadata (using with_metadata), and then save the table back out. > Is that best/standard practice or just legacy code from people using Pandas > and starting out I should just work in PyArrow exclusively? Pyarrow does not currently do everything that pandas does. Furthermore, there are no plans (that I know of) to bring over many of the features in pandas. For example, statistical tests, histogram plotting, etc. If you are doing general purpose data analytics it is highly likely you will want to use pandas. Knowing exactly where to draw the line with functionality can be tricky and different people on this ML will probably have different opinions. On Fri, Apr 23, 2021 at 2:50 AM Michael Lavina <michael.lav...@factset.com> wrote: > > Hello Team, > > The docs for Custom Metadata in PyArrow say TODO > https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata > So I am wondering if someone has any example of adding some custom metadata > to PyArrow. > > I tried looking through the pyarrow github repo and looking at they handle > pandas conversion because I know that uses metadata and see they do this > ``` > metadata.update(pandas_metadata) > schema = schema.with_metadata(metadata) > ``` > > Is that how most people handle metadata they create the schema and field > first and then do with_metadata or do people construct schemas and fields > with metadata already? > > I guess a follow up as well is that most examples I see use Pandas to start > and convert to arrow using `from_pandas`. > > Is that best/standard practice or just legacy code from people using Pandas > and starting out I should just work in PyArrow exclusively? For context, I am > more interested in ease of use and less so performance, but I don’t want to > not care at memory/performance cost at all. > > Thanks again everyone, > Michael