korbit-ai[bot] commented on code in PR #34763: URL: https://github.com/apache/superset/pull/34763#discussion_r2298968063
########## superset/commands/database/uploaders/csv_reader.py: ########## @@ -188,13 +189,32 @@ index_col = kwargs.get("index_col") if isinstance(index_col, str): result.index.name = index_col - return result - return pd.DataFrame() + df = result + else: + df = pd.read_csv( + filepath_or_buffer=file.stream, + **kwargs, + ) - return pd.read_csv( - filepath_or_buffer=file.stream, - **kwargs, - ) + if types: + for column, dtype in types.items(): + try: + df[column] = df[column].astype(dtype) Review Comment: ### Inefficient Column-by-Column Type Conversion <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? Type conversion is performed column by column, which is less efficient than converting all columns at once. ###### Why this matters Sequential column conversion creates multiple intermediate DataFrames and increases memory usage. ###### Suggested change ∙ *Feature Preview* Convert all columns at once using pandas' built-in functionality: ```python if types: try: df = df.astype(types) except ValueError as ex: # Handle error and identify problematic column ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:387d2699-b8f6-4b2d-8b08-906faef26e7e --> [](387d2699-b8f6-4b2d-8b08-906faef26e7e) ########## superset/commands/database/uploaders/csv_reader.py: ########## @@ -188,13 +189,32 @@ index_col = kwargs.get("index_col") if isinstance(index_col, str): result.index.name = index_col - return result - return pd.DataFrame() + df = result + else: + df = pd.read_csv( + filepath_or_buffer=file.stream, + **kwargs, + ) - return pd.read_csv( - filepath_or_buffer=file.stream, - **kwargs, - ) + if types: + for column, dtype in types.items(): + try: + df[column] = df[column].astype(dtype) + except ValueError as ex: + error_msg = f"Non {dtype} value found in column '{column}'." + ex_msg = str(ex) + invalid_value = ex_msg.split(":")[-1].strip().strip("'") + error_msg += ( + f" Value: '{invalid_value}'" if invalid_value else "" + ) + mask = df[column] == invalid_value + if mask.any() and invalid_value: + line_number = mask.idxmax() + kwargs.get("header", 0) + 1 Review Comment: ### Inefficient Invalid Value Detection <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The code performs a full column scan to find the first occurrence of an invalid value. ###### Why this matters This creates an unnecessary intermediate boolean mask array and scans the entire column even after finding the first invalid value. ###### Suggested change ∙ *Feature Preview* Use more efficient first occurrence detection: ```python try: invalid_idx = df.index[df[column] == invalid_value][0] line_number = invalid_idx + kwargs.get("header", 0) + 1 except IndexError: line_number = None ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:1bf1dc73-1eab-492d-9d87-9df32c635bca --> [](1bf1dc73-1eab-492d-9d87-9df32c635bca) ########## superset/commands/database/uploaders/csv_reader.py: ########## @@ -188,13 +189,32 @@ def _read_csv( # noqa: C901 index_col = kwargs.get("index_col") if isinstance(index_col, str): result.index.name = index_col - return result - return pd.DataFrame() + df = result + else: + df = pd.read_csv( + filepath_or_buffer=file.stream, + **kwargs, + ) - return pd.read_csv( - filepath_or_buffer=file.stream, - **kwargs, - ) + if types: + for column, dtype in types.items(): + try: + df[column] = df[column].astype(dtype) + except ValueError as ex: + error_msg = f"Non {dtype} value found in column '{column}'." + ex_msg = str(ex) + invalid_value = ex_msg.split(":")[-1].strip().strip("'") + error_msg += ( + f" Value: '{invalid_value}'" if invalid_value else "" + ) + mask = df[column] == invalid_value Review Comment: ### Unreliable Invalid Value Detection <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The code assumes the invalid value extracted from the exception message matches exactly with values in the DataFrame, which may not always be true due to type differences or string representation variations. ###### Why this matters This can lead to incorrect line number reporting or missing error detection when the invalid value's string representation differs from how it appears in the DataFrame. ###### Suggested change ∙ *Feature Preview* Use pandas' type conversion error handling to identify problematic values more reliably: ```python try: pd.to_numeric(df[column]) if dtype in ('int64', 'float64') else df[column].astype(dtype) except ValueError as ex: error_msg = f"Non {dtype} value found in column '{column}'." mask = pd.to_numeric(df[column], errors='coerce').isna() if dtype in ('int64', 'float64') else df[column].astype(dtype, errors='ignore') == df[column] if mask.any(): line_number = mask.idxmax() + kwargs.get('header', 0) + 1 invalid_value = df.loc[mask.idxmax(), column] error_msg += f" Value: '{invalid_value}', line: {line_number}." raise DatabaseUploadFailed(message=error_msg) from ex ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:e455147b-bf2d-4c49-a656-37dbf69d2f17 --> [](e455147b-bf2d-4c49-a656-37dbf69d2f17) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org For additional commands, e-mail: notifications-h...@superset.apache.org