betodealmeida opened a new issue #8225: Pandas casting int64 to float64, misrepresenting value URL: https://github.com/apache/incubator-superset/issues/8225 I have the following data being returned by Presto (single column, 6 rows): ``` [(None,), (1239162456494753670,), (None,), (None,), (None,), (None,) ``` Due to the missing data (`None`), Pandas infers the type as `float64`, converting the value to a wrong id: ```python >>> column_names = ['organization_lyft_id'] >>> data = [(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)] >>> df = pd.DataFrame(list(data), columns=column_names).infer_objects() # SupersetDataFrame >>> print(df) >>> print(df.dtypes) organization_lyft_id 0 NaN 1 1.239162e+18 2 NaN 3 NaN 4 NaN 5 NaN organization_lyft_id float64 dtype: object ``` The number then shows up as `1239162456494753800` in SQL Lab. Here's the Pandas documentation on this: > ... pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. **Some integers cannot even be represented as floating point numbers.** (emphasis mine) Note that if the missing data is filtered the value is inferred as an int64, and it shows up correctly in SQL Lab. The solution is to pass a `dtypes` argument when creating the Pandas data frame, built from the cursor description. I'm working on a fix for this.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
