Re: New here - Python client to pandas

Xiangdong Huang Tue, 02 Feb 2021 17:51:21 -0800

Hi,

Ok, I am interested in two things:
1. is it faster enough for getting such data in 1.67s?
2. how to improve the performance to transform the dataset to DataFrame.


Let's discuss it one by one (maybe you are more interested in the #2).
#1.

A. As your fetch_size is 1000, and there are 40778 rows totally, 1.67s is
not the final time cost.
To check the real time cost (i.e., the client has got all data records),
you should count the following time cost:

//time begin

session_data_set = session.execute_query_statement("select
s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57
from root.sae.trends.1 order by time desc")
while session_data_set.has_next():
  do nothing
//time end

B. Could you have a try using Java?
Step:
1) in IoTDB's source code, find file:
`example/jdbc/src/main/java/org/apache/iotdb/JDBCExample.java`, in the main
function, just leave `statement.executeQuery`, and change the SQL to yours.
2) modify `outputResult` to screen `System.out.print`
3) run the main() and count the time cost.

C. As you select 30 columns in one SQL, I think maybe we can change some
IoTDB configuration to improve the performance. (let's leave it in the
future.)

#2.
At least I find two potential performance issues:

> x.append(tuple(session_data_set.next().__str__().split()))

 __str__() is not efficient in IoTDB's Python implementation
(client-py/src/iotdb/utils/RowRecord.py), From a Java programmer's view,
it copies String again and again.

Instead, why not generating a matrix directly?

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Igor Gois <[email protected]> 于2021年2月3日周三 上午1:23写道：

> Hi Steve,
>
> Executing only the query took 1,67s:
>
> %%time
> session = Session(ip, port_, username_, password_,fetch_size,zone_id)
> session.open(False)
> session_data_set.set_fetch_size(1024)
> session_data_set = session.execute_query_statement("select
>
> s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57
> from root.sae.trends.1 order by time desc")
>
> output:
>
> setting time zone_id as -0300, message: None
> Wall time: 1.67 s
>
>
> The same query + dataset to dataframe process took 48s:
>
> def sql_to_pandas(sql):
>
>     session_data_set = session.execute_query_statement(sql)
>     results_columns = session_data_set.get_column_names()
>     x = []
>
>     while session_data_set.has_next():
>         x.append(tuple(session_data_set.next().__str__().split()))
>
>     results = pd.DataFrame(x,columns=results_columns)
>     results['Time'] = results['Time'].astype('int64').apply(lambda x:
> x/1000)
>     results['Time'] = pd.to_datetime(results['Time'], unit="s")
>
>     return results
>
> # execute sql query statement
> x = sql_to_pandas("select
>
> s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57
> from root.sae.trends.1 order by time desc")
>
> output:
>
> [40778 rows x 30 columns]
> Wall time: 48.4 s
>
>
> If someone has any idea on how to improve this, please let me know
>
> Thank you so much everyone
>
> Att,
>
> Igor Gois
>
>
>
>
>
>
> Am Di., 2. Feb. 2021 um 12:57 Uhr schrieb Steve Su <[email protected]>:
>
> > Hi Igor,
> >
> > > Is there a faster way to do this?
> >
> > From my point of view, your current implementation should be fast enough.
> >
> > > When I remove the 'limit 100' it takes too long. The time series has 72
> > sensors with 40k datapoints each.
> >
> > For IoTDB, this amount of data is not large. I want to know how much time
> > was spent in executing the query (excluding the time to convert the
> dataset
> > into a DataFrame).
> >
> > In addition, you can also have a check on the proportion of unsequence
> > data. When there is a lot of unsequence data, the query may be slow.
> > Sequence data and unsequence data are under the folders
> data/data/sequence
> > and data/data/unsequence respectively.
> >
> > Best,
> > Steve Su
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "dev" <[email protected]>;
> > 发送时间: 2021年2月2日(星期二) 晚上8:09
> > 收件人: "dev"<[email protected]>;
> > 主题: Re: New here - Python client to pandas
> >
> > Thank you, Xiangdong
> >
> > Here is the python code that I used:
> >
> > # execute sql query statement
> > session_data_set = session.execute_query_statement("select
> >
> >
> s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s61,s62,s64,s72,s56,s57
> > from root.sae.trends.1 order by time desc limit 100")
> >
> > session_data_set.set_fetch_size(1024)
> > results_columns = session_data_set.get_column_names()
> > print(session_data_set.get_column_types())
> > print(results_columns)
> >
> > results = pd.DataFrame(columns=results_columns)
> >
> > while session_data_set.has_next():
> >     row = session_data_set.next()
> >     row_str = row.__str__().split()
> >     row_str[0] =
> > datetime.fromtimestamp(row.get_timestamp()/1000).strftime("%Y-%m-%d
> > %H:%M:%S")
> >
> >     results =
> >
> results.append(pd.Series(row_str,index=results_columns),ignore_index=True)
> >
> >
> > print(results)
> > session_data_set.close_operation_handle()
> >
> > When I remove the 'limit 100' it takes too long. The time series has 72
> > sensors with 40k datapoints each.
> >
> > Is there a faster way to do this?
> >
> > Thank you in advance
> >
> >
> > IoTDB> select count(s1) from root.sae.trends.1
> > +---------------------------+
> > |count(root.sae.trends.1.s1)|
> > +---------------------------+
> > |                      40778|
> > +---------------------------+
> >
> > IoTDB> count timeseries root
> > +-----+
> > |count|
> > +-----+
> > |   72|
> > +-----+
> >
> > Att,
> >
> > Igor Gois
> >
> >
> >
> >
> >
> >
> > Am Di., 2. Feb. 2021 um 02:00 Uhr schrieb Xiangdong Huang <
> > [email protected]>:
> >
> > > By the way, indeed we should consider whether our iotdb-session python
> > API
> > > is efficient enough (we have many array operations and serialization
> > > operations in the Python lib).
> > >
> > > As I am not an expert of Python, I cannot get the conclusion. Also call
> > > contributors.
> > >
> > > Best,
> > > -----------------------------------
> > > Xiangdong Huang
> > > School of Software, Tsinghua University
> > >
> > >  黄向东
> > > 清华大学 软件学院
> > >
> > >
> > > Xiangdong Huang <[email protected]> 于2021年2月2日周二 下午12:56写道：
> > >
> > > > Hi Igor,
> > > >
> > > > > I am sorry if this is not the right place to ask
> > > >
> > > > The mailing list is absolutely the right place:D
> > > > But the problem is our mailing list only supports plain texts.
> > > > If you have to upload screenshots, please create a jira (
> > > > https://issues.apache.org/jira/projects/IOTDB/issues).
> > > >
> > > > Actually several days ago, my mate @[email protected]
> > > > <[email protected]>  also discussed with me about integration
> IoTDB
> > > > python with Pandas.
> > > > I think it is a great idea, and we can develop it together.
> > > >
> > > > Best,
> > > > -----------------------------------
> > > > Xiangdong Huang
> > > > School of Software, Tsinghua University
> > > >
> > > >  黄向东
> > > > 清华大学 软件学院
> > > >
> > > >
> > > > Igor Gois <[email protected]> 于2021年2月2日周二 上午6:23写道：
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> I am sorry if this is not the right place to ask, but I couldn't
> > install
> > > >> QQ or we chat. I am from Brazil.
> > > >>
> > > >> I am using a python client and I would like to know if there is a
> way
> > to
> > > >> transform a session dataset (session.execute_query_statement) into a
> > > pandas
> > > >> dataframe?
> > > >>
> > > >> I tried with the example code in the while loop. It worked, but it
> was
> > > >> not so fast.
> > > >>
> > > >> I am in a project to store and analyze wind turbines data and iotdb
> > > seems
> > > >> to be a great fit.
> > > >>
> > > >> Thanks in advance and sorry if this is the wrong place
> > > >>
> > > >> [image: image.png]
> > > >>
> > > >>
> > > >> Igor Gois
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > >
>

Re: New here - Python client to pandas

Reply via email to