lupko opened a new issue, #1513:
URL: https://github.com/apache/arrow-adbc/issues/1513

   Hi all,
   
   I am currently evaluating/trying out ABDC (Python) in the following context:
   
   - Have a Flight RPC data service - let's call it 'connector' - that is 
responsible for handling arbitrary SQL queries to PostgreSQL and producing 
results as stream of Arrow data
   - The result provided by 'connector' is then used by other services that do 
additional 'post-SQL' processing; the result of 'connector' is further 
'massaged' using different algorithms (typically aggregations or ML algorithms).
   - The PostgreSQL data stored in NUMERIC columns typically represents some 
measurable value - which is used in the different algorithms applied by the 
other services.
   - It is a hard constraint that the results produced by the 'connector' can 
be used as-is in the subsequent steps.
   
   In this context, the current ADBC PostgreSQL behavior where all NUMERIC 
columns are converted to string is not ideal. I understand the motivation - 
ADBC does not want to do lossy conversion by default and string is the only 
safe bet at the moment; I need to find a way out though :) 
   
   I was entertaining the idea that the 'connector' talking to PostgreSQL would 
perform the conversion on the fly so that it produces data where NUMERICs are 
in the end doubles (e.g. pipe the record batches produced by ADBC via Acero to 
perform str->double conversion on the way out of the service). 
   
   The tricky part here is that i'm unable to make this work in a simple 
deterministic way with just the SQL 'in hand': ADBC result gives no indication 
(or I was unable to find it? please let me know) that a string column in the 
result was in fact created from NUMERIC data.
   
   ---
   
   The ideal solution in my context would be for ADBC PostgreSQL to return 
NUMERIC columns as double. I read through the issues and found that you have 
been considering this as optional behavior.
   
   I dug into this a little bit and found in PostgreSQL source that it does 
numeric -> double conversion by calling `strtod` on the string representation 
(e.g. the algorithm that is already ported to ADBC's PostgreSQL driver). So if 
ADBC was to follow the suit, it could reuse the existing numeric -> string 
conversion and then do `strtod` on top of it - it could be quite simple.
   
   For reference, here is a working prototype: 
https://github.com/lupko/arrow-adbc/commit/6d52952b343dde5acc16e388b6592226bc4c3507
   
   - refactored reader so that the numeric -> str code & friends from 
PostgreSQL are in common base class
   - there is one class for Numeric to String; as before, the code populates 
Arrow buffers directly as before
   - there is another class for Numeric to Double; string representation is 
written into temporary location from where code does strtod and populates Arrow 
buffer with the `double` value
   
   What is missing in the prototype is some logic that decides whether to 
produce string or double - code in prototype always produces double.
   
   If you find this general direction acceptable, I could take it further and 
complete it. I just need a little extra guidance about making this optional. At 
what level should the setting be available? What should be the name? What is 
the best way to 'tunnel' the option down to the code that decides that Arrow 
type to use? (e.g. PostgresType::SetSchema seems like a good place - a simple 
conditional here would be enough; no need to add extra conditions into 
adbcpq::MakeCopyFieldReader)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to