[ 
https://issues.apache.org/jira/browse/DRILL-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014216#comment-17014216
 ] 

benj edited comment on DRILL-7449 at 1/28/20 11:28 AM:
-------------------------------------------------------

I have had a full check and in reality we havn't used the drill-url-tools 
because it sometimes produce incorrect values on big dataset (due to memory 
problem catch into UDF ?) .
EDIT 28/01/2020 : Just found a bug in drill-url-tools and proposed a 
correction. It may correct the problem.

After some other tests, the standard Drill *parse_url* works well (no Memory 
leak) +if remove the ORDER BY clause+.
And note that Memory leaked can already appears with url_parse (from 
drill-url-tools) if using ORDER BY clause produce already.

The only code that does not cause any critical problem for our use is regexp of 
the type:
{code:sql}
SELECT REGEXP_REPLACE(Activity,'^(?:.*:.*@)?([^:]*)(?::.*)?$','$1') As Host
FROM (SELECT REGEXP_REPLACE(NULLIF(Url, 
''),'^(?:(?:[^:/?#]+):)?(?://([^/?#]*))(?:[^?#]*)?(?:.*)?','$1') AS Activity 
FROM ...)
{code}

Don't know why, but in terms of observation, ORDER BY clause produce number of 
error of different contexts with complex request and it's sometimes necessary 
to split the request into 2 distinct requests (one for the SELECT with 
computations and one for the SELECT with ORDER BY)

Note that with the regexp there is no error even with ORDER BY clause.


was (Author: benj641):
I have had a full check and in reality we havn't used the drill-url-tools 
because it sometimes produce incorrect values on big dataset (due to memory 
problem catch into UDF ?) .

After some other tests, the standard Drill *parse_url* works well (no Memory 
leak) +if remove the ORDER BY clause+.
And note that Memory leaked can already appears with url_parse (from 
drill-url-tools) if using ORDER BY clause produce already.

The only code that does not cause any critical problem for our use is regexp of 
the type:
{code:sql}
SELECT REGEXP_REPLACE(Activity,'^(?:.*:.*@)?([^:]*)(?::.*)?$','$1') As Host
FROM (SELECT REGEXP_REPLACE(NULLIF(Url, 
''),'^(?:(?:[^:/?#]+):)?(?://([^/?#]*))(?:[^?#]*)?(?:.*)?','$1') AS Activity 
FROM ...)
{code}

Don't know why, but in terms of observation, ORDER BY clause produce number of 
error of different contexts with complex request and it's sometimes necessary 
to split the request into 2 distinct requests (one for the SELECT with 
computations and one for the SELECT with ORDER BY)

Note that with the regexp there is no error even with ORDER BY clause.

> memory leak parse_url function
> ------------------------------
>
>                 Key: DRILL-7449
>                 URL: https://issues.apache.org/jira/browse/DRILL-7449
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.16.0
>            Reporter: benj
>            Assignee: Igor Guzenko
>            Priority: Major
>         Attachments: embedded_FullJsonProfile.txt, embedded_sqlline.log.txt, 
> embedded_sqlline_with_enable_debug_logging.log.txt
>
>
> Requests with *parse_url* works well when the number of treated rows is low 
> but produce memory leak when number of rows grows (~ between 500 000 and 1 
> million) (and for certain number of row sometimes the request works and 
> sometimes it failed with memory leaks)
> Extract from dataset tested:
> {noformat}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:49:38Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:49:38Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"172.217.8.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/beginilah-cara-orang-jepang-berpacaran.html","Version":1.5}
> {"Attributable":true,"Description":"Website has been identified as malicious 
> by 
> Bing","FirstReportedDateTime":"2018-03-12T18:14:51Z","IndicatorExpirationDateTime":"2018-04-11T23:33:13Z","IndicatorProvider":"Bing","IndicatorThreatType":"MaliciousUrl","IsPartnerShareable":true,"IsProductLicensed":true,"LastReportedDateTime":"2018-03-12T18:14:51Z","NetworkDestinationAsn":15169,"NetworkDestinationIPv4":"216.58.192.193","NetworkDestinationPort":80,"Tags":["us"],"ThreatDetectionProduct":"ES","TLPLevel":"Amber","Url":"http://pasuruanbloggers.blogspot.ru/2012/12/cara-membuat-widget-slideshow-postingan.html","Version":1.5}
> {noformat}
> Request tested:
> {code:sql}
> ALTER SESSION SET `store.format`='parquet';
> ALTER SESSION SET `store.parquet.use_new_reader` = true;
> ALTER SESSION SET `store.parquet.compression` = 'snappy';
> ALTER SESSION SET `drill.exec.functions.cast_empty_string_to_null`= true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> ALTER SESSION SET `exec.enable_union_type` = true;
> ALTER SESSION SET `store.json.all_text_mode` = true;
> CREATE TABLE dfs.test.`output_pqt` AS
> (
> SELECT R.parsed.host AS Domain
> FROM ( 
>   SELECT parse_url(T.Url) AS parsed
>   FROM dfs.test.`file.json` AS T
> ) AS R 
> ORDER BY Domain
> );
> {code}
>  
>  Result when memory leak:
> {noformat}
> Error: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. 
> Memory leaked: (256)
> Allocator(frag:3:0) 3000000/256/9337280/30000000000 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 3000000/256/9337280/30000000000 (res/actual/peak/limit)
>     org.apache.drill.exec.memory.BaseAllocator.close():520
>     org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
>     org.apache.drill.exec.ops.FragmentContextImpl.close():546
>     
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
>     org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1149
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():624
>     java.lang.Thread.run():748 (state=,code=0)
> java.sql.SQLException: SYSTEM ERROR: IllegalStateException: Memory was leaked 
> by query. Memory leaked: (256)
> Allocator(frag:3:0) 3000000/256/9337280/30000000000 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 3000000/256/9337280/30000000000 (res/actual/peak/limit)
>     org.apache.drill.exec.memory.BaseAllocator.close():520
>     org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
>     org.apache.drill.exec.ops.FragmentContextImpl.close():546
>     
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
>     org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1149
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():624
>     java.lang.Thread.run():748
>       at 
> org.apache.drill.jdbc.impl.DrillCursor.nextRowInternally(DrillCursor.java:538)
>       at 
> org.apache.drill.jdbc.impl.DrillCursor.loadInitialSchema(DrillCursor.java:610)
>       at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:1278)
>       at 
> org.apache.drill.jdbc.impl.DrillResultSetImpl.execute(DrillResultSetImpl.java:58)
>       at 
> org.apache.calcite.avatica.AvaticaConnection$1.execute(AvaticaConnection.java:667)
>       at 
> org.apache.drill.jdbc.impl.DrillMetaImpl.prepareAndExecute(DrillMetaImpl.java:1107)
>       at 
> org.apache.drill.jdbc.impl.DrillMetaImpl.prepareAndExecute(DrillMetaImpl.java:1118)
>       at 
> org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:675)
>       at 
> org.apache.drill.jdbc.impl.DrillConnectionImpl.prepareAndExecuteInternal(DrillConnectionImpl.java:200)
>       at 
> org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:156)
>       at 
> org.apache.calcite.avatica.AvaticaStatement.execute(AvaticaStatement.java:217)
>       at sqlline.Commands.execute(Commands.java:971)
>       at sqlline.Commands.sql(Commands.java:906)
>       at sqlline.SqlLine.dispatch(SqlLine.java:740)
>       at sqlline.SqlLine.runCommands(SqlLine.java:1691)
>       at sqlline.Commands.run(Commands.java:1513)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:38)
>       at sqlline.SqlLine.dispatch(SqlLine.java:736)
>       at sqlline.SqlLine.initArgs(SqlLine.java:475)
>       at sqlline.SqlLine.begin(SqlLine.java:531)
>       at sqlline.SqlLine.start(SqlLine.java:270)
>       at sqlline.SqlLine.main(SqlLine.java:201)
> Caused by: org.apache.drill.common.exceptions.UserRemoteException: SYSTEM 
> ERROR: IllegalStateException: Memory was leaked by query. Memory leaked: (256)
> Allocator(frag:3:0) 3000000/256/9337280/30000000000 (res/actual/peak/limit)
> Fragment 3:0
> Please, refer to logs for more information.
> [Error Id: 3ffa5b43-0dde-4518-bb5a-ea3aab97f3d4 on servor01:31010]
>   (java.lang.IllegalStateException) Memory was leaked by query. Memory 
> leaked: (256)
> Allocator(frag:3:0) 3000000/256/9337280/30000000000 (res/actual/peak/limit)
>     org.apache.drill.exec.memory.BaseAllocator.close():520
>     org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose():552
>     org.apache.drill.exec.ops.FragmentContextImpl.close():546
>     
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():386
>     org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():214
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():329
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1149
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():624
>     java.lang.Thread.run():748
>       at 
> org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:123)
>       at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:422)
>       at org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:96)
>       at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:273)
>       at 
> org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:243)
>       at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
>       at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
>       at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
>       at 
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312)
>       at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
>       at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
>       at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>       at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>       at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>       at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException: Memory was leaked by query. 
> Memory leaked: (256)
> Allocator(frag:3:0) 3000000/256/9337280/30000000000 (res/actual/peak/limit)
>       at 
> org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:520)
>       at 
> org.apache.drill.exec.ops.FragmentContextImpl.suppressingClose(FragmentContextImpl.java:552)
>       at 
> org.apache.drill.exec.ops.FragmentContextImpl.close(FragmentContextImpl.java:546)
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:386)
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:214)
>       at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:329)
>       at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>       at .......(:0
> {noformat}
> Note that same request including some other field and excluding use of 
> _parse_url_ works perfectly well.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to