Re: New here - Python client to pandas

Jialin Qiao Tue, 02 Feb 2021 19:11:27 -0800

Hi,

Two points:


(1)

&gt; while session_data_set.has_next():
&gt;   do nothing

need to call next() to iterate the whole dataset:

while session_data_set.has_next():
  session_data_set.next()

(2)
&gt; C. As you select 30 columns in one SQL, I think maybe we can change some
&gt; IoTDB configuration to improve the performance. (let's leave it in the
&gt; future.)

You could try to use the tracing tool[1] to get more information.

Beside, to tune the performance of IoTDB, we need to know more:

JVM memory allocation: iotdb-env.sh MAX_HEAP_SIZE
Number of storage groups: count storage group
Total number of time series: count timeseries

[1] 
http://iotdb.apache.org/UserGuide/Master/System%20Tools/Performance%20Tracing%20Tool.html

Thanks,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

&gt; -----原始邮件-----
&gt; 发件人: "Xiangdong Huang" <[email protected]>
&gt; 发送时间: 2021-02-03 09:50:59 (星期三)
&gt; 收件人: dev <[email protected]>
&gt; 抄送: 
&gt; 主题: Re: New here - Python client to pandas
&gt; 
&gt; Hi,
&gt; 
&gt; Ok, I am interested in two things:
&gt; 1. is it faster enough for getting such data in 1.67s?
&gt; 2. how to improve the performance to transform the dataset to DataFrame.
&gt; 
&gt; Let's discuss it one by one (maybe you are more interested in the #2).
&gt; #1.
&gt; 
&gt; A. As your fetch_size is 1000, and there are 40778 rows totally, 1.67s is
&gt; not the final time cost.
&gt; To check the real time cost (i.e., the client has got all data records),
&gt; you should count the following time cost:
&gt; 
&gt; //time begin
&gt; 
&gt; session_data_set = session.execute_query_statement("select
&gt; 
s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57
&gt; from root.sae.trends.1 order by time desc")
&gt; while session_data_set.has_next():
&gt;   do nothing
&gt; //time end
&gt; 
&gt; B. Could you have a try using Java?
&gt; Step:
&gt; 1) in IoTDB's source code, find file:
&gt; `example/jdbc/src/main/java/org/apache/iotdb/JDBCExample.java`, in the main
&gt; function, just leave `statement.executeQuery`, and change the SQL to yours.
&gt; 2) modify `outputResult` to screen `System.out.print`
&gt; 3) run the main() and count the time cost.
&gt; 
&gt; C. As you select 30 columns in one SQL, I think maybe we can change some
&gt; IoTDB configuration to improve the performance. (let's leave it in the
&gt; future.)
&gt; 
&gt; #2.
&gt; At least I find two potential performance issues:
&gt; 
&gt; &gt; x.append(tuple(session_data_set.next().__str__().split()))
&gt; 
&gt;  __str__() is not efficient in IoTDB's Python implementation
&gt; (client-py/src/iotdb/utils/RowRecord.py), From a Java programmer's view,
&gt; it copies String again and again.
&gt; 
&gt; Instead, why not generating a matrix directly?
&gt; 
&gt; Best,
&gt; -----------------------------------
&gt; Xiangdong Huang
&gt; School of Software, Tsinghua University
&gt; 
&gt;  黄向东
&gt; 清华大学 软件学院
&gt; 
&gt; 
&gt; Igor Gois <[email protected]> 于2021年2月3日周三 上午1:23写道：
&gt; 
&gt; &gt; Hi Steve,
&gt; &gt;
&gt; &gt; Executing only the query took 1,67s:
&gt; &gt;
&gt; &gt; %%time
&gt; &gt; session = Session(ip, port_, username_, password_,fetch_size,zone_id)
&gt; &gt; session.open(False)
&gt; &gt; session_data_set.set_fetch_size(1024)
&gt; &gt; session_data_set = session.execute_query_statement("select
&gt; &gt;
&gt; &gt; 
s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57
&gt; &gt; from root.sae.trends.1 order by time desc")
&gt; &gt;
&gt; &gt; output:
&gt; &gt;
&gt; &gt; setting time zone_id as -0300, message: None
&gt; &gt; Wall time: 1.67 s
&gt; &gt;
&gt; &gt;
&gt; &gt; The same query + dataset to dataframe process took 48s:
&gt; &gt;
&gt; &gt; def sql_to_pandas(sql):
&gt; &gt;
&gt; &gt;     session_data_set = session.execute_query_statement(sql)
&gt; &gt;     results_columns = session_data_set.get_column_names()
&gt; &gt;     x = []
&gt; &gt;
&gt; &gt;     while session_data_set.has_next():
&gt; &gt;         x.append(tuple(session_data_set.next().__str__().split()))
&gt; &gt;
&gt; &gt;     results = pd.DataFrame(x,columns=results_columns)
&gt; &gt;     results['Time'] = results['Time'].astype('int64').apply(lambda x:
&gt; &gt; x/1000)
&gt; &gt;     results['Time'] = pd.to_datetime(results['Time'], unit="s")
&gt; &gt;
&gt; &gt;     return results
&gt; &gt;
&gt; &gt; # execute sql query statement
&gt; &gt; x = sql_to_pandas("select
&gt; &gt;
&gt; &gt; 
s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57
&gt; &gt; from root.sae.trends.1 order by time desc")
&gt; &gt;
&gt; &gt; output:
&gt; &gt;
&gt; &gt; [40778 rows x 30 columns]
&gt; &gt; Wall time: 48.4 s
&gt; &gt;
&gt; &gt;
&gt; &gt; If someone has any idea on how to improve this, please let me know
&gt; &gt;
&gt; &gt; Thank you so much everyone
&gt; &gt;
&gt; &gt; Att,
&gt; &gt;
&gt; &gt; Igor Gois
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt; Am Di., 2. Feb. 2021 um 12:57 Uhr schrieb Steve Su 
<[email protected]>:
&gt; &gt;
&gt; &gt; &gt; Hi Igor,
&gt; &gt; &gt;
&gt; &gt; &gt; &gt; Is there a faster way to do this?
&gt; &gt; &gt;
&gt; &gt; &gt; From my point of view, your current implementation should be 
fast enough.
&gt; &gt; &gt;
&gt; &gt; &gt; &gt; When I remove the 'limit 100' it takes too long. The time 
series has 72
&gt; &gt; &gt; sensors with 40k datapoints each.
&gt; &gt; &gt;
&gt; &gt; &gt; For IoTDB, this amount of data is not large. I want to know how 
much time
&gt; &gt; &gt; was spent in executing the query (excluding the time to convert 
the
&gt; &gt; dataset
&gt; &gt; &gt; into a DataFrame).
&gt; &gt; &gt;
&gt; &gt; &gt; In addition, you can also have a check on the proportion of 
unsequence
&gt; &gt; &gt; data. When there is a lot of unsequence data, the query may be 
slow.
&gt; &gt; &gt; Sequence data and unsequence data are under the folders
&gt; &gt; data/data/sequence
&gt; &gt; &gt; and data/data/unsequence respectively.
&gt; &gt; &gt;
&gt; &gt; &gt; Best,
&gt; &gt; &gt; Steve Su
&gt; &gt; &gt;
&gt; &gt; &gt; ------------------ 原始邮件 ------------------
&gt; &gt; &gt; 发件人: "dev" <[email protected]>;
&gt; &gt; &gt; 发送时间: 2021年2月2日(星期二) 晚上8:09
&gt; &gt; &gt; 收件人: "dev"<[email protected]>;
&gt; &gt; &gt; 主题: Re: New here - Python client to pandas
&gt; &gt; &gt;
&gt; &gt; &gt; Thank you, Xiangdong
&gt; &gt; &gt;
&gt; &gt; &gt; Here is the python code that I used:
&gt; &gt; &gt;
&gt; &gt; &gt; # execute sql query statement
&gt; &gt; &gt; session_data_set = session.execute_query_statement("select
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; 
s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57
&gt; &gt; &gt; from root.sae.trends.1 order by time desc limit 100")
&gt; &gt; &gt;
&gt; &gt; &gt; session_data_set.set_fetch_size(1024)
&gt; &gt; &gt; results_columns = session_data_set.get_column_names()
&gt; &gt; &gt; print(session_data_set.get_column_types())
&gt; &gt; &gt; print(results_columns)
&gt; &gt; &gt;
&gt; &gt; &gt; results = pd.DataFrame(columns=results_columns)
&gt; &gt; &gt;
&gt; &gt; &gt; while session_data_set.has_next():
&gt; &gt; &gt;     row = session_data_set.next()
&gt; &gt; &gt;     row_str = row.__str__().split()
&gt; &gt; &gt;     row_str[0] =
&gt; &gt; &gt; 
datetime.fromtimestamp(row.get_timestamp()/1000).strftime("%Y-%m-%d
&gt; &gt; &gt; %H:%M:%S")
&gt; &gt; &gt;
&gt; &gt; &gt;     results =
&gt; &gt; &gt;
&gt; &gt; 
results.append(pd.Series(row_str,index=results_columns),ignore_index=True)
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; print(results)
&gt; &gt; &gt; session_data_set.close_operation_handle()
&gt; &gt; &gt;
&gt; &gt; &gt; When I remove the 'limit 100' it takes too long. The time series 
has 72
&gt; &gt; &gt; sensors with 40k datapoints each.
&gt; &gt; &gt;
&gt; &gt; &gt; Is there a faster way to do this?
&gt; &gt; &gt;
&gt; &gt; &gt; Thank you in advance
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; IoTDB&gt; select count(s1) from root.sae.trends.1
&gt; &gt; &gt; +---------------------------+
&gt; &gt; &gt; |count(root.sae.trends.1.s1)|
&gt; &gt; &gt; +---------------------------+
&gt; &gt; &gt; |                      40778|
&gt; &gt; &gt; +---------------------------+
&gt; &gt; &gt;
&gt; &gt; &gt; IoTDB&gt; count timeseries root
&gt; &gt; &gt; +-----+
&gt; &gt; &gt; |count|
&gt; &gt; &gt; +-----+
&gt; &gt; &gt; |   72|
&gt; &gt; &gt; +-----+
&gt; &gt; &gt;
&gt; &gt; &gt; Att,
&gt; &gt; &gt;
&gt; &gt; &gt; Igor Gois
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; Am Di., 2. Feb. 2021 um 02:00 Uhr schrieb Xiangdong Huang &lt;
&gt; &gt; &gt; [email protected]&gt;:
&gt; &gt; &gt;
&gt; &gt; &gt; &gt; By the way, indeed we should consider whether our 
iotdb-session python
&gt; &gt; &gt; API
&gt; &gt; &gt; &gt; is efficient enough (we have many array operations and 
serialization
&gt; &gt; &gt; &gt; operations in the Python lib).
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; As I am not an expert of Python, I cannot get the 
conclusion. Also call
&gt; &gt; &gt; &gt; contributors.
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; Best,
&gt; &gt; &gt; &gt; -----------------------------------
&gt; &gt; &gt; &gt; Xiangdong Huang
&gt; &gt; &gt; &gt; School of Software, Tsinghua University
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;  黄向东
&gt; &gt; &gt; &gt; 清华大学 软件学院
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; Xiangdong Huang <[email protected]> 于2021年2月2日周二 下午12:56写道：
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Hi Igor,
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; I am sorry if this is not the right place to ask
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; The mailing list is absolutely the right place:D
&gt; &gt; &gt; &gt; &gt; But the problem is our mailing list only supports 
plain texts.
&gt; &gt; &gt; &gt; &gt; If you have to upload screenshots, please create a 
jira (
&gt; &gt; &gt; &gt; &gt; https://issues.apache.org/jira/projects/IOTDB/issues).
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Actually several days ago, my mate 
@[email protected]
&gt; &gt; &gt; &gt; &gt; <[email protected]>  also discussed with me about 
integration
&gt; &gt; IoTDB
&gt; &gt; &gt; &gt; &gt; python with Pandas.
&gt; &gt; &gt; &gt; &gt; I think it is a great idea, and we can develop it 
together.
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Best,
&gt; &gt; &gt; &gt; &gt; -----------------------------------
&gt; &gt; &gt; &gt; &gt; Xiangdong Huang
&gt; &gt; &gt; &gt; &gt; School of Software, Tsinghua University
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;  黄向东
&gt; &gt; &gt; &gt; &gt; 清华大学 软件学院
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Igor Gois <[email protected]> 于2021年2月2日周二 上午6:23写道：
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt;&gt; Hi everyone,
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt; I am sorry if this is not the right place to ask, 
but I couldn't
&gt; &gt; &gt; install
&gt; &gt; &gt; &gt; &gt;&gt; QQ or we chat. I am from Brazil.
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt; I am using a python client and I would like to 
know if there is a
&gt; &gt; way
&gt; &gt; &gt; to
&gt; &gt; &gt; &gt; &gt;&gt; transform a session dataset 
(session.execute_query_statement) into a
&gt; &gt; &gt; &gt; pandas
&gt; &gt; &gt; &gt; &gt;&gt; dataframe?
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt; I tried with the example code in the while loop. 
It worked, but it
&gt; &gt; was
&gt; &gt; &gt; &gt; &gt;&gt; not so fast.
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt; I am in a project to store and analyze wind 
turbines data and iotdb
&gt; &gt; &gt; &gt; seems
&gt; &gt; &gt; &gt; &gt;&gt; to be a great fit.
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt; Thanks in advance and sorry if this is the wrong 
place
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt; [image: image.png]
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt; Igor Gois
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt; &gt;&gt;
&gt; &gt; &gt; &gt;
&gt; &gt;
</[email protected]></[email protected]></[email protected]></[email protected]></[email protected]></[email protected]></[email protected]></[email protected]></[email protected]>

Re: New here - Python client to pandas

Reply via email to