Hyukjin Kwon created SPARK-32098: ------------------------------------ Summary: Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow Key: SPARK-32098 URL: https://issues.apache.org/jira/browse/SPARK-32098 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0, 2.4.6 Reporter: Hyukjin Kwon
When you use floats are index of pandas, it contains a duplicate rows: {code} >>> import pandas as pd >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., >>> 4.])).show() +---+ | a| +---+ | 1| | 1| | 2| +---+ {code} This is because direct slicing uses the value as index when the index contains floats: {code} >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:] a 2.0 1 3.0 2 4.0 3 >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:] a 4.0 3 >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:] a 4 3 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org