Re: [PR] HIVE-28095: Hive Query History [hive]

via GitHub Fri, 24 Jan 2025 07:21:31 -0800


abstractdog commented on PR #5613:
URL: https://github.com/apache/hive/pull/5613#issuecomment-2612783611


   > > > > > What is the use case for that service? can't I check the query 
history in HUE or DAS (removed for some storage reason), etc Please take a look 
at #5319 which is being worked on by 
[rtrivedi12](https://github.com/rtrivedi12) I think it provides some extra 
details for an active queries cc @nrg4878
   > > > > 
   > > > > 
   > > > > looks like #5319 is completely different, it uses a well-known SHOW 
PROCESSLIST for the live queries (live==recent==present in hs2 memory), whereas 
Query History Service is meant to be a scalable historical query service, 
scalable in a sense that it uses the iceberg table format
   > > > > HUE/DAS might work from different sources, like the protobuf 
history, which's data source also created by a query hook, but this service 
aims to redesign the way of persisting data while trying to use the same or 
similar field names that has already been implemented by impala
   > > > > the current HiveProtoLoggingHook contains much information about the 
storage details (e.g. rolling over files and stuff), which makes it look a bit 
less modern when compared to e.g. iceberg format, by which we win everything 
(in terms of performance for instance) that we achieve by working on 
integrating iceberg into our product
   > > > 
   > > > 
   > > > is that supposed to do the same thing as Impala profile: 
https://github.com/apache/impala/blob/fdc43466350db4437b3e917d4ff24dac58af63c3/testdata/impala-profiles/impala_profile_log_tpcds_compute_stats_v2_default.expected.txt#L1445?
   > > 
   > > 
   > > if you mean the corresponding impala table, that's implemented in: 
https://issues.apache.org/jira/browse/IMPALA-12426 I can see different upstream 
commits for that, this is the closest one: 
[apache/impala@711a9f2](https://github.com/apache/impala/commit/711a9f2bad84f92dc4af61d49ae115f0dc4239da)
   > > their table is sys.impala_query_log
   > 
   > have you checked how did they implement it, maybe used some tricks to 
improve perf/make it a background activity, etc?
   
   in detail no, we discussed high-level problems, like what the schema would 
look like and how the table is supposed to be compacted (it's not 
automatically, it should be taken care of by the platform)
   when it comes to performance details, what I tried to achieve is:
   1. transform the query data to query history record - this is sync I admit, 
let me add extra logging:
   ```
       long start = Time.monotonicNow();
       QueryHistoryRecord record = createRecord(driverContext);
       LOG.debug("Created history record (in {}ms): {}", Time.monotonicNow() - 
start, record);
   ```
   2. the rest of the work is done async
   3. set maxBatchSize to 100 by default and defined memory limit too, so every 
100 query records should be written in one batch...I felt this a good tradeoff 
between too many small files vs. too rare writes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Re: [PR] HIVE-28095: Hive Query History [hive]

Reply via email to