[jira] [Created] (FLINK-15538) Separate decimal implementations into separate sub-classes
Liya Fan created FLINK-15538: Summary: Separate decimal implementations into separate sub-classes Key: FLINK-15538 URL: https://issues.apache.org/jira/browse/FLINK-15538 Project: Flink Issue Type: Improvement Components: Table SQL / Runtime Reporter: Liya Fan The current implementation of Decimal values have two (somewhat independent) implementations: one is based on Long, while the other is based on BigDecimal. This makes the Decmial class not clear (both implementations cluttered in a single class) and less efficient (each method involves a if-else branch). So in this issue, we make Decimal an abstract class, and separate the two implementation into two sub-classes. This makes the code clearer and more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14731) LogicalWatermarkAssigner should use specified trait set when doing copy
Liya Fan created FLINK-14731: Summary: LogicalWatermarkAssigner should use specified trait set when doing copy Key: FLINK-14731 URL: https://issues.apache.org/jira/browse/FLINK-14731 Project: Flink Issue Type: Bug Components: Table SQL / Planner Reporter: Liya Fan In LogicalWatermarkAssigner#copy method, creating the new LogicalWatermarkAssigner object should use the trait set from the input parameter, instead of the trait set of the current object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-13200) Improve the generated code for if statements
Liya Fan created FLINK-13200: Summary: Improve the generated code for if statements Key: FLINK-13200 URL: https://issues.apache.org/jira/browse/FLINK-13200 Project: Flink Issue Type: Improvement Components: Table SQL / Planner Reporter: Liya Fan Assignee: Liya Fan In the generated code, we often code snippet like this: if (true) { acc$6.setNullAt(1); } else { acc$6.setField(1, ((int) -1));; } Such code impacts the code readability, and increases the code size, making it more costly for compiling and transferring through network. In this issue, we remove such useless if conditions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13108) Remove duplicated type cast in generated code
Liya Fan created FLINK-13108: Summary: Remove duplicated type cast in generated code Key: FLINK-13108 URL: https://issues.apache.org/jira/browse/FLINK-13108 Project: Flink Issue Type: Bug Components: Table SQL / Planner Reporter: Liya Fan Assignee: Liya Fan There are duplicated cast operations in the generated code. For example, to run org.apache.flink.table.runtime.batch.sql.join.JoinITCase#testJoin, the generated code looks like this: @Override public void processElement(org.apache.flink.streaming.runtime.streamrecord.StreamRecord element) throws Exception { org.apache.flink.table.dataformat.BaseRow in1 = (org.apache.flink.table.dataformat.BaseRow) (org.apache.flink.table.dataformat.BaseRow) converter$0.toInternal((org.apache.flink.types.Row) element.getValue()); This issue remove the duplicated type cast. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13058) Avoid memory copy for the trimming operations of BinaryString
Liya Fan created FLINK-13058: Summary: Avoid memory copy for the trimming operations of BinaryString Key: FLINK-13058 URL: https://issues.apache.org/jira/browse/FLINK-13058 Project: Flink Issue Type: Improvement Components: Table SQL / Runtime Reporter: Liya Fan Assignee: Liya Fan For trimming operations of BinaryString (trim, trimLeft, trimRight), if the trimmed string is identical to the original string. The memory copy can be avoided by directly returning the original string. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13053) Vectorization Support in Flink
Liya Fan created FLINK-13053: Summary: Vectorization Support in Flink Key: FLINK-13053 URL: https://issues.apache.org/jira/browse/FLINK-13053 Project: Flink Issue Type: New Feature Components: Table SQL / Runtime Reporter: Liya Fan Assignee: Liya Fan Attachments: image-2019-07-02-15-26-39-550.png Vectorization is a popular technique in SQL engines today. Compared with traditional row-based approach, it has some distinct advantages, for example: * Better use of CPU resources (e.g. SIMD) * More compact memory layout * More friendly to compressed data format. Currently, Flink is based on a row-based SQL engine for both stream and batch workloads. To enjoy the above benefits, we want to bring vectorization to Flink. This involves substantial changes to the existing code base. Therefore, we give a plan to carry out such changes in small, incremental steps, in order not to affect existing features. We want to apply it to batch workload first. The details can be found in our proposal. For the past months, we have developed an initial implementation of the above ideas. Initial performance evaluations on TPC-H benchmarks show that substantial performance improvements can be obtained by vectorization (see the figure below). More details can be found in our proposal. !image-2019-07-02-15-26-39-550.png! Special thanks to @Kurt Young’s team for all the kind help. Special thanks to @Piotr Nowojski for all the valuable feedback and help suggestions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13043) Fix the bug of parsing Dewey number from string
Liya Fan created FLINK-13043: Summary: Fix the bug of parsing Dewey number from string Key: FLINK-13043 URL: https://issues.apache.org/jira/browse/FLINK-13043 Project: Flink Issue Type: Bug Components: Library / CEP Reporter: Liya Fan Assignee: Liya Fan There is a bug in the current implementation for parsing the Dewey number: String[] splits = deweyNumberString.split("\\."); if (splits.length == 0) { return new DeweyNumber(Integer.parseInt(deweyNumberString)); } The length in the if condition should be 1 instead of 0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12922) Remove method parameter from OperatorCodeGenerator
Liya Fan created FLINK-12922: Summary: Remove method parameter from OperatorCodeGenerator Key: FLINK-12922 URL: https://issues.apache.org/jira/browse/FLINK-12922 Project: Flink Issue Type: Improvement Components: Table SQL / Planner Reporter: Liya Fan Assignee: Liya Fan The TableConfig parameter of OperatorCodGenerator#generateOneInputStreamOperator should be removed, because: # This parameter is never actually used. # If it is ever used in the future, we can use ctx.getConfig to get the same object # The method signature should be consistent. The method generateTwoInputStreamOperator does not have this parameter. So this parameter should also be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12900) Refactor the class hierarchy for BinaryFormat
Liya Fan created FLINK-12900: Summary: Refactor the class hierarchy for BinaryFormat Key: FLINK-12900 URL: https://issues.apache.org/jira/browse/FLINK-12900 Project: Flink Issue Type: Improvement Components: Table SQL / Runtime Reporter: Liya Fan Assignee: Liya Fan The there are many classes in the class hierarchy of BinaryFormat. They share the same memory format: header + nullable bits + fixed length part + variable length part So many operations can be applied to a number of sub-classes. Currently, many such operations are implemented in each sub-class, although they implement identical functionality. This makes the code hard to understand and maintain. In this proposal, we refactor the class hierarchy, and move common operations into the base class, leaving only one implementation for each common operation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12886) Support container memory segment
Liya Fan created FLINK-12886: Summary: Support container memory segment Key: FLINK-12886 URL: https://issues.apache.org/jira/browse/FLINK-12886 Project: Flink Issue Type: New Feature Components: Table SQL / Runtime Reporter: Liya Fan Assignee: Liya Fan Attachments: image-2019-06-18-17-59-42-136.png We observe that in many scenarios, the operations/algorithms are based on an array of MemorySegment. These memory segments form a large, combined, and continuous memory space. For example, suppose we have an array of n memory segments. Memory addresses from 0 to segment_size - 1 are served by the first memory segment; memory addresses from segment_size to 2 * segment_size - 1 are served by the second memory segment, and so on. Specific algorithms decide the actual MemorySegment to serve the operation requests. For some rare cases, two or more memory segments serve the requests. There are many operations based on such a paradigm, for example, {{BinaryString#matchAt}}, {{SegmentsUtil#copyToBytes}}, {{LongHashPartition#MatchIterator#get}}, etc. The problem is that, for memory segment array based operations, large amounts of code is devoted to 1. Computing the memory segment index & offset within the memory segment. 2. Processing boundary cases. For example, to write an integer, there are only 2 bytes left in the first memory segment, and the remaining 2 bytes must be written to the next memory segment. 3. Differentiate processing for short/long data. For example, when copying memory data to a byte array. Different methods are implemented for cases when 1) the data fits in a single segment; 2) the data spans multiple segments. Therefore, there are much duplicated code to achieve above purposes. What is worse, this paradigm significantly increases the amount of code, making the code more difficult to read and maintain. Furthermore, it easily gives rise to bugs which difficult to find and debug. To address these problems, we propose a new type of memory segment: {{ContainerMemorySegment}}. It is based on an array of underlying memory segments with the same size. It extends from the {{MemorySegment}} base class, so it provides all the functionalities provided by {{MemorySegment}}. In addition, it hides all the details for dealing with specific memory segments, and acts as if it were a big continuous memory region. A prototype implementation is given below: !image-2019-06-18-17-59-42-136.png|thumbnail! With this new type of memory segment, many operations/algorithms can be greatly simplified, without affecting performance. This is because, 1. Many checks, boundary processing are already there. We just move them to the new class. 2. We optimize the implementation of the new class, so the special optimizations (e.g. optimizations for short data) are still preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12879) Improve the performance of AbstractBinaryWriter
Liya Fan created FLINK-12879: Summary: Improve the performance of AbstractBinaryWriter Key: FLINK-12879 URL: https://issues.apache.org/jira/browse/FLINK-12879 Project: Flink Issue Type: Improvement Components: Table SQL / Runtime Reporter: Liya Fan Assignee: Liya Fan Improve the performance of AbstractBinaryWriter by: 1. remove unnecessary memory copy 2. improve the performance of rounding buffer size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12730) Combine BitSet implementations in flink-runtime
Liya Fan created FLINK-12730: Summary: Combine BitSet implementations in flink-runtime Key: FLINK-12730 URL: https://issues.apache.org/jira/browse/FLINK-12730 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Liya Fan Assignee: Liya Fan There are two implementations for BitSet in flink-runtime ocmponent: one is org.apache.flink.runtime.operators.util.BloomFilter#BitSet, while the other is org.apache.flink.runtime.operators.util.BitSet The two classes are quite similar in their API and implementations. The only difference is that, the former is based based on long operation while the latter is based on byte operation. This has the following consequence: # The byte based BitSet has better performance for get/set operations. # The long based BitSet has better performance for the clear operation. We combine the two implementations and make the best of both worlds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12687) ByteHashSet is always in dense mode
Liya Fan created FLINK-12687: Summary: ByteHashSet is always in dense mode Key: FLINK-12687 URL: https://issues.apache.org/jira/browse/FLINK-12687 Project: Flink Issue Type: Improvement Components: Runtime / Operators Reporter: Liya Fan Assignee: Liya Fan Since there are only 256 possible byte values, the largest possible range is 255, and the condition range < OptimizableHashSet.DENSE_THRESHOLD must be satisfied. So ByteHashSet must be in dense mode. We can make use of this to improve the performance and code structure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12593) Revise the document for CEP
Liya Fan created FLINK-12593: Summary: Revise the document for CEP Key: FLINK-12593 URL: https://issues.apache.org/jira/browse/FLINK-12593 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Liya Fan The document for CEP (flink/docs/dev/libs/cep.md) can be difficult to understand and follow, especially for beginners. I suggest revising from the following aspects: 1. Give more detailed descriptions of existing examples. 2. More examples are required to illustrate the features. 3. More explanations are required for some concepts, like contiguity. 4. We can add more references to better understand the concepts. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12553) Fix a bug in SqlDateTimeUtils#parseToTimeMillis
Liya Fan created FLINK-12553: Summary: Fix a bug in SqlDateTimeUtils#parseToTimeMillis Key: FLINK-12553 URL: https://issues.apache.org/jira/browse/FLINK-12553 Project: Flink Issue Type: Bug Components: Table SQL / Runtime Reporter: Liya Fan Assignee: Liya Fan If parameter "1999-12-31 12:34:56.123" is used, it should return 123. But it returns 230 now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12361) Remove useless expression from runtime scheduler
Liya Fan created FLINK-12361: Summary: Remove useless expression from runtime scheduler Key: FLINK-12361 URL: https://issues.apache.org/jira/browse/FLINK-12361 Project: Flink Issue Type: Improvement Components: Runtime / Operators Reporter: Liya Fan Assignee: Liya Fan Attachments: image-2019-04-29-11-16-13-492.png In the scheduleTask method of Scheduler class, expression forceExternalLocation is useless, since it always evaluates to false: !image-2019-04-29-11-16-13-492.png! So it can be removed. Moreover, by removing this expression, the code structure can be made much simpler, because there are some branches relying this expression, which can also be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12335) Improvement the performance of class SegmentsUtil
Liya Fan created FLINK-12335: Summary: Improvement the performance of class SegmentsUtil Key: FLINK-12335 URL: https://issues.apache.org/jira/browse/FLINK-12335 Project: Flink Issue Type: Improvement Components: Table SQL / Runtime Reporter: Liya Fan Assignee: Liya Fan Improve the performance of class SegmentsUtil from two points: In method allocateReuseBytes, the generated byte array should be cached for reuse, if the size does not exceed MAX_BYTES_LENGTH. However, the array is not cached if bytes.length < length, and this will lead to performance overhead: if (bytes == null) { if (length <= MAX_BYTES_LENGTH) { bytes = new byte[MAX_BYTES_LENGTH]; BYTES_LOCAL.set(bytes); } else { bytes = new byte[length]; } } else if (bytes.length < length) { bytes = new byte[length]; } 2. To evaluate the offset, an integer is bitand with a mask to clear to low bits, and then shift right. The bitand is useless: ((index & BIT_BYTE_POSITION_MASK) >>> 3) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12290) Fix the misleading exception message in SharedSlot class
Liya Fan created FLINK-12290: Summary: Fix the misleading exception message in SharedSlot class Key: FLINK-12290 URL: https://issues.apache.org/jira/browse/FLINK-12290 Project: Flink Issue Type: Bug Reporter: Liya Fan Assignee: Liya Fan The exception message in SharedSlot.releaseSlot is misleading. It says "SharedSlot is not empty and released". But the condition should be "SharedSlot is not released and empty." -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12289) Fix bugs and typos in Memory manager
Liya Fan created FLINK-12289: Summary: Fix bugs and typos in Memory manager Key: FLINK-12289 URL: https://issues.apache.org/jira/browse/FLINK-12289 Project: Flink Issue Type: Bug Components: Runtime / Operators Reporter: Liya Fan Assignee: Liya Fan According to the JavaDoc, MemoryManager.release method should throw an NPE if the input argument is null. In addition, there are some typos in class MemoryManager. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12223) HeapMemorySegment.getArray should return null after being freed
Liya Fan created FLINK-12223: Summary: HeapMemorySegment.getArray should return null after being freed Key: FLINK-12223 URL: https://issues.apache.org/jira/browse/FLINK-12223 Project: Flink Issue Type: Bug Reporter: Liya Fan Assignee: Liya Fan According to the JavaDoc, HeapMemorySegment.getArray should return null, but it does not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12216) Respect the number of bytes from input parameters in HybridMemorySegment
Liya Fan created FLINK-12216: Summary: Respect the number of bytes from input parameters in HybridMemorySegment Key: FLINK-12216 URL: https://issues.apache.org/jira/browse/FLINK-12216 Project: Flink Issue Type: Bug Reporter: Liya Fan Assignee: Liya Fan For the following two methods in HybridMemorySegment class, public final void get(int offset, ByteBuffer target, int numBytes) public final void put(int offset, ByteBuffer source, int numBytes) the actual number of bytes read/written should be specified by the input parameter numBytes, but it does not for some types of ByteBuffer. Instead, it simply read/write until the end. So this is a bug and I am going to fix it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12210) Fix a bug in AbstractPagedInputView.readLine
Liya Fan created FLINK-12210: Summary: Fix a bug in AbstractPagedInputView.readLine Key: FLINK-12210 URL: https://issues.apache.org/jira/browse/FLINK-12210 Project: Flink Issue Type: Bug Components: Runtime / Operators Reporter: Liya Fan Assignee: Liya Fan In AbstractPagedInputView.readLine, character '\r' is removed in two places, which is redundant. In addition, only trailing '\r' should be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12189) Fix bugs and typos in memory management related classes
Liya Fan created FLINK-12189: Summary: Fix bugs and typos in memory management related classes Key: FLINK-12189 URL: https://issues.apache.org/jira/browse/FLINK-12189 Project: Flink Issue Type: Bug Components: Runtime / Operators Reporter: Liya Fan Assignee: Liya Fan There are some bugs and typos in the memory related source code. For example, # In MemoryManager.release method, it should throw an NPE, if the input is null. But it does not. # In HybridMemorySegment.put and get methods, the number of bytes read/written should be specified by the input parameter, rather than reading/writing until the end. # In HeapMemorySegment, the data member memory should be returned. This is because the member heapArray will not be reset to null after calling the release method. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12162) Build error in flink-table-planner
Liya Fan created FLINK-12162: Summary: Build error in flink-table-planner Key: FLINK-12162 URL: https://issues.apache.org/jira/browse/FLINK-12162 Project: Flink Issue Type: Bug Components: Table SQL / Runtime Reporter: Liya Fan Assignee: Liya Fan There is a build error in project flink-table-planner: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project flink-table-planner_2.11: Compilation failure [ERROR] .../flink-table-planner/src/main/java/org/apache/flink/table/operations/ProjectionOperationFactory.java:[85,54] unreported exception X; must be caught or declared to be thrown I am using JDK 1.8.0_45, maven 3.5.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11421) Providing more compilation options for code-generated operators
Liya Fan created FLINK-11421: Summary: Providing more compilation options for code-generated operators Key: FLINK-11421 URL: https://issues.apache.org/jira/browse/FLINK-11421 Project: Flink Issue Type: New Feature Components: Core Reporter: Liya Fan Assignee: Liya Fan Flink supports some operators (like Calc, Hash Agg, Hash Join, etc.) by code generation. That is, Flink generates their source code dynamically, and then compile it into Java Byte Code, which is load and executed at runtime. By default, Flink compiles the generated source code by Janino. This is fast, as the compilation often finishes in hundreds of milliseconds. The generated Java Byte Code, however, is of poor quality. To illustrate, we use Java Compiler API (JCA) to compile the generated code. Experiments on TPC-H (1 TB) queries show that the E2E time can be more than 10% shorter, when operators are compiled by JCA, despite that it takes more time (a few seconds) to compile with JCA. Therefore, we believe it is beneficial to compile generated code by JCA in the following scenarios: 1) For batch jobs, the E2E time is relatively long, so it is worth of spending more time compiling and generating high quality Java Byte Code. 2) For repeated stream jobs, the generated code will be compiled once and run many times. Therefore, it pays to spend more time compiling for the first time, and enjoy the high byte code qualities for later runs. According to the above observations, we want to provide a compilation option (Janino, JCA, or dynamic) for Flink, so that the user can choose the one suitable for their specific scenario and obtain better performance whenever possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)