Re: big issue on retrieving 400MB data

吴朱华 Tue, 09 May 2017 03:22:16 -0700

We have tested few days, it is 600seconds to fetch around 6GB data when
using big fetch size, 10MB per sec, the performance still has room to
improve for hive jdbc driver or impala front-end.


2017-04-29 14:32 GMT+08:00 yu feng <[email protected]>:

> my pleasure.
>
> 2017-04-29 14:16 GMT+08:00 吴朱华 <[email protected]>:
>
> > thx，yufeng，your solution save the day^_^
> >
> > 2017-04-29 13:50 GMT+08:00 yu feng <[email protected]>:
> >
> > > We has faced this problem before with JDBC, this query speed too much
> > time
> > > on fetching result, and the query is running. which will occupy slot of
> > > request pool(when we configured admission control), we have to cancel
> the
> > > query after a Certain time time such as 5 minutes.
> > >
> > > In our solution, JDBC fetch result with default fetchSize = 50, we try
> to
> > > set fetchSize = 10000（by call setFetchSize()） and the acceleration is
> > > obvious（in our test, result size is 2000+W and this can speed up from
> > more
> > > than 1 hour to 500 seconds.）.
> > >
> > > 2017-04-29 0:39 GMT+08:00 Mostafa Mokhtar <[email protected]>:
> > >
> > > > Hi Jeszy
> > > >
> > > >
> > > > One option for large data extracts is to create a text table in HDFS
> > then
> > > > use "hdfs dfs -copyToLocal
> > > > <https://hadoop.apache.org/docs/current/hadoop-project-
> > > dist/hadoop-common/
> > > > FileSystemShell.html#copyToLocal>"
> > > > to extract the text file(s), for large data dumps storing the data in
> > > HDFS
> > > > then extracting it is likely to be faster than impala-shell or JDCB.
> > > >
> > > > When using impala-shell it is best practice to use -B to speedup
> > fetching
> > > > the results.
> > > >
> > > > Below are the results for two experiments, storing the results in an
> > HDFS
> > > > text table and writing the out of HDFS file took 14 seconds compared
> to
> > > 40
> > > > seconds when using just impala-shell.
> > > >
> > > > Lineitem table was used, output data size is 390MB which is
> comparable
> > to
> > > > your use case.
> > > >
> > > > *Write to HDFS then extract*
> > > >
> > > > time (impala-shell -q "drop table if exists
> > tpch_300_parquet.lineitem_3m;
> > > > create table tpch_300_parquet.lineitem_3m stored as textfile as
> select
> > *
> > > > from tpch_300_parquet.lineitem limit 3000000;" ; hdfs dfs
> -copyToLocal
> > > > hdfs://
> > > > vd1315.domain.com:8020/user/hive/warehouse/tpch_300_
> > > parquet.db/lineitem_3m
> > > > .; du -sh lineitem_3m )
> > > > Starting Impala Shell without Kerberos authentication
> > > > Connected to server1.domain.com:21000
> > > > Server version: impalad version 2.9.0-SNAPSHOT RELEASE (build
> > > > dc25acd558e9388290503e0e1100baac2b3c800b)
> > > > Query: drop table if exists tpch_300_parquet.lineitem_3m
> > > > Query: create table tpch_300_parquet.lineitem_3m stored as textfile
> as
> > > > select * from tpch_300_parquet.lineitem limit 3000000
> > > > Query submitted at: 2017-04-28 09:26:23 (Coordinator:
> > > > http://server1.domain.com:25000)
> > > > Query progress can be monitored at:
> > > > http://server1.domain.com:25000/query_plan?query_id=a24f86518c3de1ce
> :
> > > > 2f35ef9b00000000
> > > > +-------------------------+
> > > > | summary                 |
> > > > +-------------------------+
> > > > | Inserted 3000000 row(s) |
> > > > +-------------------------+
> > > > Fetched 1 row(s) in 11.23s
> > > > 390M lineitem_3m
> > > >
> > > > *real 0m14.392s*
> > > > user 0m3.503s
> > > > sys 0m0.655s
> > > >
> > > >
> > > > *Extract results using impala-shell*
> > > >
> > > > [mmokhtar@server1 ~]$ time (impala-shell -B -q "select * From
> > > > tpch_300_parquet.lineitem_3m " > lineitem_3m_impala_shell.txt; du -sh
> > > > lineitem_3m_impala_shell.txt; )
> > > > Starting Impala Shell without Kerberos authentication
> > > > Connected to server1.domain.com:21000
> > > > Server version: impalad version 2.9.0-SNAPSHOT RELEASE (build
> > > > dc25acd558e9388290503e0e1100baac2b3c800b)
> > > > Query: select * From tpch_300_parquet.lineitem_3m
> > > > Query submitted at: 2017-04-28 09:27:39 (Coordinator:
> > > > http://server1.domain.com:25000)
> > > > Query progress can be monitored at:
> > > > http://server1.domain.com:25000/query_plan?query_id=ee421b6d4a2226d3
> :
> > > > 8acbb75f00000000
> > > > Fetched 3000000 row(s) in 38.88s
> > > > 390M lineitem_3m_impala_shell.txt
> > > >
> > > > *real 0m39.152s*
> > > > user 0m26.012s
> > > > sys 0m0.668s
> > > >
> > > >
> > > > On Fri, Apr 28, 2017 at 7:26 AM, Jeszy <[email protected]> wrote:
> > > >
> > > > > Hey,
> > > > >
> > > > > It looks like all the time is spent waiting for the client to fetch
> > the
> > > > > results:
> > > > >      - ClientFetchWaitTimer: 17m31s
> > > > >
> > > > > Try doing:
> > > > > impala-shell -B -q '<query>'
> > > > >
> > > > > HTH
> > > > >
> > > > > 2017-04-28 14:51 GMT+02:00 吴朱华 <[email protected]>:
> > > > > > Maybe I just paste some main thing on mail , and congratulation
> on
> > > IPO
> > > > > > thing.
> > > > > >
> > > > > > Unregister query: 17m42s (17m42s)
> > > > > >
> > > > > > Fetched 317246 row(s) in 1062.84s
> > > > > > Query Runtime Profile:
> > > > > > Query (id=8149e2439f43b15a:f08e570d7fbf1085):
> > > > > >   Summary:
> > > > > >     Session ID: 35436d1112b79287:9045c79c795858a5
> > > > > >     Session Type: BEESWAX
> > > > > >     Start Time: 2017-04-28 11:50:00.292615000
> > > > > >     End Time: 2017-04-28 12:07:43.133484000
> > > > > >     Query Type: QUERY
> > > > > >     Query State: FINISHED
> > > > > >     Query Status: OK
> > > > > >     Impala Version: impalad version 2.5.0-cdh5-INTERNAL RELEASE
> > > (build
> > > > > > 43880282edc04c03c162bbea6fc85b5388e7fdde)
> > > > > >     User: impala
> > > > > >     Connected User: impala
> > > > > >     Delegated User:
> > > > > >     Network Address: ::ffff:10.44.10.186:36325
> > > > > >     Default Db: sjzy
> > > > > >     Sql Statement: select
> > > > > > MRECID,UNITID,PCQDM,PCQMC,PCXQDM,PCXQMC,DM,H001,H002,
> > > > > H003,H021,H022,H023,H024,H025,H026A,H026B,H026C,H026D,H026E,
> > > > > H026F,H026G,H027,H028,H029,H030,H031,H032,H033,H034,H035,
> > > > > H036,H037A,H037B,H037C,H038,H039,H040,H041,H042,H043A,
> > > > > H043B,H043C,H043D,H043E,H043F,H043G,H043H,H043I,H043J,H043K,
> > > > > H043L,H044A,H044B,H044C,H044D,H044E,H044F,H044G,H044H,H044I,
> > > > > H050,H051,H052,H053,H054,H055,H056,H061,H062,H063,H064,H065,
> > > > > H066,H070,H071,H072,H073,H074,H075,H080,H100,H111,H112,H113,
> > > > > H120,H200,H201,H202,H203,H204,H205,H206,H207,H208,H209,H210,
> > > > > H211,H300,H320,H321,H322,H323,H324,H400,H401,H402,H403,H404,
> > > > > H405,H406,H500,H600,H601,H602,H603,H604,H605,H606,H607,H608,
> > > > > H609,H610,H611,H612,H613,H614,H615,H616,H621A,H621B,H621C,
> > > > > H621D,H621E,H621F,H622A,H622B,H622C,H801,H802,H803,H804,
> > H901,H902,H903
> > > > > > FROM NP_2017_NP601 WHERE DS_AREACODE LIKE '445281%'
> > > > > >     Coordinator: node1.sky.org:22000
> > > > > >     Query Options (non default):
> > > > > >     Plan:
> > > > > > ----------------
> > > > > > Estimated Per-Host Requirements: Memory=4.50GB VCores=1
> > > > > >
> > > > > > 01:EXCHANGE [UNPARTITIONED]
> > > > > > |  hosts=4 per-host-mem=unavailable
> > > > > > |  tuple-ids=0 row-size=1.67KB cardinality=1155911
> > > > > > |
> > > > > > 00:SCAN HDFS [sjzy.np_2017_np601, RANDOM]
> > > > > >    partitions=1/1 files=20 size=1.06GB
> > > > > >    predicates: DS_AREACODE LIKE '445281%'
> > > > > >    table stats: 11559109 rows total
> > > > > >    column stats: all
> > > > > >    hosts=4 per-host-mem=4.50GB
> > > > > >    tuple-ids=0 row-size=1.67KB cardinality=1155911
> > > > > > ----------------
> > > > > >     Estimated Per-Host Mem: 4831838208
> > > > > >     Estimated Per-Host VCores: 1
> > > > > >     Request Pool: default-pool
> > > > > >     ExecSummary:
> > > > > > Operator       #Hosts  Avg Time  Max Time    #Rows  Est. #Rows
> >  Peak
> > > > Mem
> > > > > >  Est. Peak Mem  Detail
> > > > > > ------------------------------------------------------------
> > > > > -------------------------------------------------
> > > > > > 01:EXCHANGE         1  32.314ms  32.314ms  317.25K       1.16M
> > > >   0
> > > > > >      -1.00 B  UNPARTITIONED
> > > > > > 00:SCAN HDFS       20   1s137ms   1s348ms  317.25K       1.16M
> > > 163.85
> > > > MB
> > > > > >      4.50 GB  sjzy.np_2017_np601
> > > > > >     Planner Timeline: 53.683ms
> > > > > >        - Analysis finished: 24.565ms (24.565ms)
> > > > > >        - Equivalence classes computed: 26.389ms (1.823ms)
> > > > > >        - Single node plan created: 33.607ms (7.218ms)
> > > > > >        - Runtime filters computed: 33.684ms (76.568us)
> > > > > >        - Distributed plan created: 39.125ms (5.441ms)
> > > > > >        - Planning finished: 53.683ms (14.558ms)
> > > > > >     Query Timeline: 17m42s
> > > > > >        - Start execution: 43.792us (43.792us)
> > > > > >        - Planning finished: 60.640ms (60.596ms)
> > > > > >        - Ready to start 20 remote fragments: 65.111ms (4.471ms)
> > > > > >        - All 20 remote fragments started: 74.572ms (9.461ms)
> > > > > >        - Rows available: 744.300ms (669.728ms)
> > > > > >        - First row fetched: 790.128ms (45.828ms)
> > > > > >        - Unregister query: 17m42s (17m42s)
> > > > > >   ImpalaServer:
> > > > > >      - ClientFetchWaitTimer: 17m31s
> > > > > >      - RowMaterializationTimer: 10s024ms
> > > > > >
> > > > > > 2017-04-28 19:44 GMT+08:00 Jim Apple <[email protected]>:
> > > > > >
> > > > > >> dev@ does not appear to accept attachments. You can upload it
> > > > somewhere
> > > > > >> and
> > > > > >> post a link, though.
> > > > > >>
> > > > > >> On Thu, Apr 27, 2017 at 11:35 PM, 吴朱华 <[email protected]>
> wrote:
> > > > > >>
> > > > > >> > Oops, I just resend it, you know the chinese network^_^
> > > > > >> >
> > > > > >> > 2017-04-28 14:20 GMT+08:00 Mostafa Mokhtar <
> > [email protected]
> > > >:
> > > > > >> >
> > > > > >> >> Btw the profile wasn't attached.
> > > > > >> >> Please resend.
> > > > > >> >>
> > > > > >> >> On Thu, Apr 27, 2017 at 11:11 PM, 吴朱华 <[email protected]>
> > wrote:
> > > > > >> >>
> > > > > >> >>> Profile is in the attachment, thanks
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> 2017-04-28 13:10 GMT+08:00 Dimitris Tsirogiannis <
> > > > > >> >>> [email protected]>:
> > > > > >> >>>
> > > > > >> >>>> Maybe you also want to post some information about the
> schema
> > > > (how
> > > > > >> wide
> > > > > >> >>>> your table is, does it use nested types, etc) as well as
> the
> > > > > profile
> > > > > >> of
> > > > > >> >>>> the
> > > > > >> >>>> slow query.
> > > > > >> >>>>
> > > > > >> >>>> Dimitris
> > > > > >> >>>>
> > > > > >> >>>> On Thu, Apr 27, 2017 at 9:30 PM, 吴朱华 <[email protected]>
> > > wrote:
> > > > > >> >>>>
> > > > > >> >>>> > Hi guys:
> > > > > >> >>>> > we can facing a big issue when select * from a big table.
> > > > > >> >>>> > The performance is 17minutes for retrieving 400MB data.
> > Even
> > > > slow
> > > > > >> >>>> under
> > > > > >> >>>> > JDBC situation.
> > > > > >> >>>> > Is there anyway to improve it?^_^
> > > > > >> >>>> >
> > > > > >> >>>>
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: big issue on retrieving 400MB data

Reply via email to