Hi, Two points:
(1) > while session_data_set.has_next(): > do nothing need to call next() to iterate the whole dataset: while session_data_set.has_next(): session_data_set.next() (2) > C. As you select 30 columns in one SQL, I think maybe we can change some > IoTDB configuration to improve the performance. (let's leave it in the > future.) You could try to use the tracing tool[1] to get more information. Beside, to tune the performance of IoTDB, we need to know more: JVM memory allocation: iotdb-env.sh MAX_HEAP_SIZE Number of storage groups: count storage group Total number of time series: count timeseries [1] http://iotdb.apache.org/UserGuide/Master/System%20Tools/Performance%20Tracing%20Tool.html Thanks, -- Jialin Qiao School of Software, Tsinghua University 乔嘉林 清华大学 软件学院 > -----原始邮件----- > 发件人: "Xiangdong Huang" <[email protected]> > 发送时间: 2021-02-03 09:50:59 (星期三) > 收件人: dev <[email protected]> > 抄送: > 主题: Re: New here - Python client to pandas > > Hi, > > Ok, I am interested in two things: > 1. is it faster enough for getting such data in 1.67s? > 2. how to improve the performance to transform the dataset to DataFrame. > > Let's discuss it one by one (maybe you are more interested in the #2). > #1. > > A. As your fetch_size is 1000, and there are 40778 rows totally, 1.67s is > not the final time cost. > To check the real time cost (i.e., the client has got all data records), > you should count the following time cost: > > //time begin > > session_data_set = session.execute_query_statement("select > s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57 > from root.sae.trends.1 order by time desc") > while session_data_set.has_next(): > do nothing > //time end > > B. Could you have a try using Java? > Step: > 1) in IoTDB's source code, find file: > `example/jdbc/src/main/java/org/apache/iotdb/JDBCExample.java`, in the main > function, just leave `statement.executeQuery`, and change the SQL to yours. > 2) modify `outputResult` to screen `System.out.print` > 3) run the main() and count the time cost. > > C. As you select 30 columns in one SQL, I think maybe we can change some > IoTDB configuration to improve the performance. (let's leave it in the > future.) > > #2. > At least I find two potential performance issues: > > > x.append(tuple(session_data_set.next().__str__().split())) > > __str__() is not efficient in IoTDB's Python implementation > (client-py/src/iotdb/utils/RowRecord.py), From a Java programmer's view, > it copies String again and again. > > Instead, why not generating a matrix directly? > > Best, > ----------------------------------- > Xiangdong Huang > School of Software, Tsinghua University > > 黄向东 > 清华大学 软件学院 > > > Igor Gois <[email protected]> 于2021年2月3日周三 上午1:23写道: > > > Hi Steve, > > > > Executing only the query took 1,67s: > > > > %%time > > session = Session(ip, port_, username_, password_,fetch_size,zone_id) > > session.open(False) > > session_data_set.set_fetch_size(1024) > > session_data_set = session.execute_query_statement("select > > > > s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57 > > from root.sae.trends.1 order by time desc") > > > > output: > > > > setting time zone_id as -0300, message: None > > Wall time: 1.67 s > > > > > > The same query + dataset to dataframe process took 48s: > > > > def sql_to_pandas(sql): > > > > session_data_set = session.execute_query_statement(sql) > > results_columns = session_data_set.get_column_names() > > x = [] > > > > while session_data_set.has_next(): > > x.append(tuple(session_data_set.next().__str__().split())) > > > > results = pd.DataFrame(x,columns=results_columns) > > results['Time'] = results['Time'].astype('int64').apply(lambda x: > > x/1000) > > results['Time'] = pd.to_datetime(results['Time'], unit="s") > > > > return results > > > > # execute sql query statement > > x = sql_to_pandas("select > > > > s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57 > > from root.sae.trends.1 order by time desc") > > > > output: > > > > [40778 rows x 30 columns] > > Wall time: 48.4 s > > > > > > If someone has any idea on how to improve this, please let me know > > > > Thank you so much everyone > > > > Att, > > > > Igor Gois > > > > > > > > > > > > > > Am Di., 2. Feb. 2021 um 12:57 Uhr schrieb Steve Su <[email protected]>: > > > > > Hi Igor, > > > > > > > Is there a faster way to do this? > > > > > > From my point of view, your current implementation should be fast enough. > > > > > > > When I remove the 'limit 100' it takes too long. The time series has 72 > > > sensors with 40k datapoints each. > > > > > > For IoTDB, this amount of data is not large. I want to know how much time > > > was spent in executing the query (excluding the time to convert the > > dataset > > > into a DataFrame). > > > > > > In addition, you can also have a check on the proportion of unsequence > > > data. When there is a lot of unsequence data, the query may be slow. > > > Sequence data and unsequence data are under the folders > > data/data/sequence > > > and data/data/unsequence respectively. > > > > > > Best, > > > Steve Su > > > > > > ------------------ 原始邮件 ------------------ > > > 发件人: "dev" <[email protected]>; > > > 发送时间: 2021年2月2日(星期二) 晚上8:09 > > > 收件人: "dev"<[email protected]>; > > > 主题: Re: New here - Python client to pandas > > > > > > Thank you, Xiangdong > > > > > > Here is the python code that I used: > > > > > > # execute sql query statement > > > session_data_set = session.execute_query_statement("select > > > > > > > > s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57 > > > from root.sae.trends.1 order by time desc limit 100") > > > > > > session_data_set.set_fetch_size(1024) > > > results_columns = session_data_set.get_column_names() > > > print(session_data_set.get_column_types()) > > > print(results_columns) > > > > > > results = pd.DataFrame(columns=results_columns) > > > > > > while session_data_set.has_next(): > > > row = session_data_set.next() > > > row_str = row.__str__().split() > > > row_str[0] = > > > datetime.fromtimestamp(row.get_timestamp()/1000).strftime("%Y-%m-%d > > > %H:%M:%S") > > > > > > results = > > > > > results.append(pd.Series(row_str,index=results_columns),ignore_index=True) > > > > > > > > > print(results) > > > session_data_set.close_operation_handle() > > > > > > When I remove the 'limit 100' it takes too long. The time series has 72 > > > sensors with 40k datapoints each. > > > > > > Is there a faster way to do this? > > > > > > Thank you in advance > > > > > > > > > IoTDB> select count(s1) from root.sae.trends.1 > > > +---------------------------+ > > > |count(root.sae.trends.1.s1)| > > > +---------------------------+ > > > | 40778| > > > +---------------------------+ > > > > > > IoTDB> count timeseries root > > > +-----+ > > > |count| > > > +-----+ > > > | 72| > > > +-----+ > > > > > > Att, > > > > > > Igor Gois > > > > > > > > > > > > > > > > > > > > > Am Di., 2. Feb. 2021 um 02:00 Uhr schrieb Xiangdong Huang < > > > [email protected]>: > > > > > > > By the way, indeed we should consider whether our iotdb-session python > > > API > > > > is efficient enough (we have many array operations and serialization > > > > operations in the Python lib). > > > > > > > > As I am not an expert of Python, I cannot get the conclusion. Also call > > > > contributors. > > > > > > > > Best, > > > > ----------------------------------- > > > > Xiangdong Huang > > > > School of Software, Tsinghua University > > > > > > > > 黄向东 > > > > 清华大学 软件学院 > > > > > > > > > > > > Xiangdong Huang <[email protected]> 于2021年2月2日周二 下午12:56写道: > > > > > > > > > Hi Igor, > > > > > > > > > > > I am sorry if this is not the right place to ask > > > > > > > > > > The mailing list is absolutely the right place:D > > > > > But the problem is our mailing list only supports plain texts. > > > > > If you have to upload screenshots, please create a jira ( > > > > > https://issues.apache.org/jira/projects/IOTDB/issues). > > > > > > > > > > Actually several days ago, my mate @[email protected] > > > > > <[email protected]> also discussed with me about integration > > IoTDB > > > > > python with Pandas. > > > > > I think it is a great idea, and we can develop it together. > > > > > > > > > > Best, > > > > > ----------------------------------- > > > > > Xiangdong Huang > > > > > School of Software, Tsinghua University > > > > > > > > > > 黄向东 > > > > > 清华大学 软件学院 > > > > > > > > > > > > > > > Igor Gois <[email protected]> 于2021年2月2日周二 上午6:23写道: > > > > > > > > > >> Hi everyone, > > > > >> > > > > >> I am sorry if this is not the right place to ask, but I couldn't > > > install > > > > >> QQ or we chat. I am from Brazil. > > > > >> > > > > >> I am using a python client and I would like to know if there is a > > way > > > to > > > > >> transform a session dataset (session.execute_query_statement) into a > > > > pandas > > > > >> dataframe? > > > > >> > > > > >> I tried with the example code in the while loop. It worked, but it > > was > > > > >> not so fast. > > > > >> > > > > >> I am in a project to store and analyze wind turbines data and iotdb > > > > seems > > > > >> to be a great fit. > > > > >> > > > > >> Thanks in advance and sorry if this is the wrong place > > > > >> > > > > >> [image: image.png] > > > > >> > > > > >> > > > > >> Igor Gois > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > > > </[email protected]></[email protected]></[email protected]></[email protected]></[email protected]></[email protected]></[email protected]></[email protected]></[email protected]>
