[jira] [Created] (FLINK-15538) Separate decimal implementations into separate sub-classes

2020-01-09 Thread Liya Fan (Jira)
Liya Fan created FLINK-15538:


 Summary: Separate decimal implementations into separate sub-classes
 Key: FLINK-15538
 URL: https://issues.apache.org/jira/browse/FLINK-15538
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Runtime
Reporter: Liya Fan


The current implementation of Decimal values have two (somewhat independent) 
implementations: one is based on Long, while the other is based on BigDecimal. 

This makes the Decmial class not clear (both implementations cluttered in a 
single class) and less efficient (each method involves a if-else branch). 

So in this issue, we make Decimal an abstract class, and separate the two 
implementation into two sub-classes. This makes the code clearer and more 
efficient. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14731) LogicalWatermarkAssigner should use specified trait set when doing copy

2019-11-12 Thread Liya Fan (Jira)
Liya Fan created FLINK-14731:


 Summary: LogicalWatermarkAssigner should use specified trait set 
when doing copy
 Key: FLINK-14731
 URL: https://issues.apache.org/jira/browse/FLINK-14731
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Reporter: Liya Fan


In LogicalWatermarkAssigner#copy method, creating the new 
LogicalWatermarkAssigner object should use the trait set from the input 
parameter, instead of the trait set of the current object. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-13200) Improve the generated code for if statements

2019-07-10 Thread Liya Fan (JIRA)
Liya Fan created FLINK-13200:


 Summary: Improve the generated code for if statements
 Key: FLINK-13200
 URL: https://issues.apache.org/jira/browse/FLINK-13200
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Reporter: Liya Fan
Assignee: Liya Fan


In the generated code, we often code snippet like this:

 

if (true) {
  acc$6.setNullAt(1);
} else {
  acc$6.setField(1, ((int) -1));;
}

Such code impacts the code readability, and increases the code size, making it 
more costly for compiling and transferring through network.

 

In this issue, we remove such useless if conditions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13108) Remove duplicated type cast in generated code

2019-07-04 Thread Liya Fan (JIRA)
Liya Fan created FLINK-13108:


 Summary: Remove duplicated type cast in generated code
 Key: FLINK-13108
 URL: https://issues.apache.org/jira/browse/FLINK-13108
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Reporter: Liya Fan
Assignee: Liya Fan


There are duplicated cast operations in the generated code. For example, to run 
org.apache.flink.table.runtime.batch.sql.join.JoinITCase#testJoin, the 
generated code looks like this:

@Override
public void 
processElement(org.apache.flink.streaming.runtime.streamrecord.StreamRecord 
element) throws Exception {
  org.apache.flink.table.dataformat.BaseRow in1 = 
(org.apache.flink.table.dataformat.BaseRow) 
(org.apache.flink.table.dataformat.BaseRow) 
converter$0.toInternal((org.apache.flink.types.Row) element.getValue());

This issue remove the duplicated type cast.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13058) Avoid memory copy for the trimming operations of BinaryString

2019-07-02 Thread Liya Fan (JIRA)
Liya Fan created FLINK-13058:


 Summary: Avoid memory copy for the trimming operations of 
BinaryString
 Key: FLINK-13058
 URL: https://issues.apache.org/jira/browse/FLINK-13058
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Runtime
Reporter: Liya Fan
Assignee: Liya Fan


For trimming operations of BinaryString (trim, trimLeft, trimRight), if the 
trimmed string is identical to the original string. The memory copy can be 
avoided by directly returning the original string. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13053) Vectorization Support in Flink

2019-07-02 Thread Liya Fan (JIRA)
Liya Fan created FLINK-13053:


 Summary: Vectorization Support in Flink
 Key: FLINK-13053
 URL: https://issues.apache.org/jira/browse/FLINK-13053
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Runtime
Reporter: Liya Fan
Assignee: Liya Fan
 Attachments: image-2019-07-02-15-26-39-550.png

Vectorization is a popular technique in SQL engines today. Compared with 
traditional row-based approach, it has some distinct advantages, for example:

 
 * Better use of CPU resources (e.g. SIMD)
 * More compact memory layout
 * More friendly to compressed data format.

 

Currently, Flink is based on a row-based SQL engine for both stream and batch 
workloads. To enjoy the above benefits, we want to bring vectorization to 
Flink. This involves substantial changes to the existing code base. Therefore, 
we give a plan to carry out such changes in small, incremental steps, in order 
not to affect existing features. We want to apply it to batch workload first. 
The details can be found in our proposal.

 

For the past months, we have developed an initial implementation of the above 
ideas. Initial performance evaluations on TPC-H benchmarks show that 
substantial performance improvements can be obtained by vectorization (see the 
figure below). More details can be found in our proposal.

  !image-2019-07-02-15-26-39-550.png!

Special thanks to @Kurt Young’s team for all the kind help.

Special thanks to @Piotr Nowojski for all the valuable feedback and help 
suggestions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13043) Fix the bug of parsing Dewey number from string

2019-07-01 Thread Liya Fan (JIRA)
Liya Fan created FLINK-13043:


 Summary: Fix the bug of parsing Dewey number from string
 Key: FLINK-13043
 URL: https://issues.apache.org/jira/browse/FLINK-13043
 Project: Flink
  Issue Type: Bug
  Components: Library / CEP
Reporter: Liya Fan
Assignee: Liya Fan


There is a bug in the current implementation for parsing the Dewey number:

 

String[] splits = deweyNumberString.split("\\.");

if (splits.length == 0) {
 return new DeweyNumber(Integer.parseInt(deweyNumberString));
 }

 

The length in the if condition should be 1 instead of 0.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12922) Remove method parameter from OperatorCodeGenerator

2019-06-21 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12922:


 Summary: Remove method parameter from OperatorCodeGenerator
 Key: FLINK-12922
 URL: https://issues.apache.org/jira/browse/FLINK-12922
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Reporter: Liya Fan
Assignee: Liya Fan


The TableConfig parameter of 
OperatorCodGenerator#generateOneInputStreamOperator should be removed, because:
 # This parameter is never actually used.
 # If it is ever used in the future, we can use ctx.getConfig to get the same 
object
 # The method signature should be consistent. The method 
generateTwoInputStreamOperator does not have this parameter. So this parameter 
should also be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12900) Refactor the class hierarchy for BinaryFormat

2019-06-19 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12900:


 Summary: Refactor the class hierarchy for BinaryFormat
 Key: FLINK-12900
 URL: https://issues.apache.org/jira/browse/FLINK-12900
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Runtime
Reporter: Liya Fan
Assignee: Liya Fan


The there are many classes in the class hierarchy of BinaryFormat. They share 
the same memory format:

header + nullable bits + fixed length part + variable length part

So many operations can be applied to a number of sub-classes. Currently, many 
such operations are implemented in each sub-class, although they implement 
identical functionality. 

This makes the code hard to understand and maintain.

In this proposal, we refactor the class hierarchy, and move common operations 
into the base class, leaving only one implementation for each common operation. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12886) Support container memory segment

2019-06-18 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12886:


 Summary: Support container memory segment
 Key: FLINK-12886
 URL: https://issues.apache.org/jira/browse/FLINK-12886
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Runtime
Reporter: Liya Fan
Assignee: Liya Fan
 Attachments: image-2019-06-18-17-59-42-136.png

We observe that in many scenarios, the operations/algorithms are based on an 
array of MemorySegment. These memory segments form a large, combined, and 
continuous memory space.

For example, suppose we have an array of n memory segments. Memory addresses 
from 0 to segment_size - 1 are served by the first memory segment; memory 
addresses from segment_size to 2 * segment_size - 1 are served by the second 
memory segment, and so on.

Specific algorithms decide the actual MemorySegment to serve the operation 
requests. For some rare cases, two or more memory segments serve the requests. 
There are many operations based on such a paradigm, for example, 
{{BinaryString#matchAt}}, {{SegmentsUtil#copyToBytes}}, 
{{LongHashPartition#MatchIterator#get}}, etc.

The problem is that, for memory segment array based operations, large amounts 
of code is devoted to

1. Computing the memory segment index & offset within the memory segment.
 2. Processing boundary cases. For example, to write an integer, there are only 
2 bytes left in the first memory segment, and the remaining 2 bytes must be 
written to the next memory segment.
 3. Differentiate processing for short/long data. For example, when copying 
memory data to a byte array. Different methods are implemented for cases when 
1) the data fits in a single segment; 2) the data spans multiple segments.

Therefore, there are much duplicated code to achieve above purposes. What is 
worse, this paradigm significantly increases the amount of code, making the 
code more difficult to read and maintain. Furthermore, it easily gives rise to 
bugs which difficult to find and debug.

To address these problems, we propose a new type of memory segment: 
{{ContainerMemorySegment}}. It is based on an array of underlying memory 
segments with the same size. It extends from the {{MemorySegment}} base class, 
so it provides all the functionalities provided by {{MemorySegment}}. In 
addition, it hides all the details for dealing with specific memory segments, 
and acts as if it were a big continuous memory region.

A prototype implementation is given below:

 !image-2019-06-18-17-59-42-136.png|thumbnail! 

With this new type of memory segment, many operations/algorithms can be greatly 
simplified, without affecting performance. This is because,

1. Many checks, boundary processing are already there. We just move them to the 
new class.
 2. We optimize the implementation of the new class, so the special 
optimizations (e.g. optimizations for short data) are still preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12879) Improve the performance of AbstractBinaryWriter

2019-06-17 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12879:


 Summary: Improve the performance of AbstractBinaryWriter
 Key: FLINK-12879
 URL: https://issues.apache.org/jira/browse/FLINK-12879
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Runtime
Reporter: Liya Fan
Assignee: Liya Fan


Improve the performance of AbstractBinaryWriter by:
1. remove unnecessary memory copy
2. improve the performance of rounding buffer size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12730) Combine BitSet implementations in flink-runtime

2019-06-04 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12730:


 Summary: Combine BitSet implementations in flink-runtime
 Key: FLINK-12730
 URL: https://issues.apache.org/jira/browse/FLINK-12730
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Liya Fan
Assignee: Liya Fan


There are two implementations for BitSet in flink-runtime ocmponent: one is 
org.apache.flink.runtime.operators.util.BloomFilter#BitSet, while the other is 
org.apache.flink.runtime.operators.util.BitSet

The two classes are quite similar in their API and implementations. The only 
difference is that, the former is based based on long operation while the 
latter is based on byte operation. This has the following consequence:
 # The byte based BitSet has better performance for get/set operations.
 # The long based BitSet has better performance for the clear operation.

We combine the two implementations and make the best of both worlds.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12687) ByteHashSet is always in dense mode

2019-05-30 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12687:


 Summary: ByteHashSet is always in dense mode
 Key: FLINK-12687
 URL: https://issues.apache.org/jira/browse/FLINK-12687
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Operators
Reporter: Liya Fan
Assignee: Liya Fan


Since there are only 256 possible byte values, the largest possible range is 
255, and the condition 

range < OptimizableHashSet.DENSE_THRESHOLD

must be satisfied. So ByteHashSet must be in dense mode.

We can make use of this to improve the performance and code structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12593) Revise the document for CEP

2019-05-22 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12593:


 Summary: Revise the document for CEP
 Key: FLINK-12593
 URL: https://issues.apache.org/jira/browse/FLINK-12593
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Liya Fan


The document for CEP (flink/docs/dev/libs/cep.md) can be difficult to 
understand and follow, especially for beginners.

I suggest revising from the following aspects:

1. Give more detailed descriptions of existing examples.

2. More examples are required to illustrate the features.

3. More explanations are required for some concepts, like contiguity.

4. We can add more references to better understand the concepts. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12553) Fix a bug in SqlDateTimeUtils#parseToTimeMillis

2019-05-20 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12553:


 Summary: Fix a bug in SqlDateTimeUtils#parseToTimeMillis
 Key: FLINK-12553
 URL: https://issues.apache.org/jira/browse/FLINK-12553
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Runtime
Reporter: Liya Fan
Assignee: Liya Fan


If parameter "1999-12-31 12:34:56.123" is used, it should return 123. But it 
returns 230 now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12361) Remove useless expression from runtime scheduler

2019-04-28 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12361:


 Summary: Remove useless expression from runtime scheduler
 Key: FLINK-12361
 URL: https://issues.apache.org/jira/browse/FLINK-12361
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Operators
Reporter: Liya Fan
Assignee: Liya Fan
 Attachments: image-2019-04-29-11-16-13-492.png

In the scheduleTask method of Scheduler class, expression forceExternalLocation 
is useless, since it always evaluates to false:

 !image-2019-04-29-11-16-13-492.png! 

So it can be removed. Moreover, by removing this expression, the code structure 
can be made much simpler, because there are some branches relying this 
expression, which can also be removed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12335) Improvement the performance of class SegmentsUtil

2019-04-26 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12335:


 Summary: Improvement the performance of class SegmentsUtil
 Key: FLINK-12335
 URL: https://issues.apache.org/jira/browse/FLINK-12335
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Runtime
Reporter: Liya Fan
Assignee: Liya Fan


Improve the performance of class SegmentsUtil from two points:

In method allocateReuseBytes, the generated byte array should be cached for 
reuse, if the size does not exceed MAX_BYTES_LENGTH. However, the array is not 
cached if bytes.length < length, and this will lead to performance overhead:

if (bytes == null) {
  if (length <= MAX_BYTES_LENGTH) {
    bytes = new byte[MAX_BYTES_LENGTH]; BYTES_LOCAL.set(bytes); 
  }
  else {
    bytes = new byte[length]; 
  }
} else if (bytes.length < length) {
  bytes = new byte[length]; 
}

 

2. To evaluate the offset, an integer is bitand with a mask to clear to low 
bits, and then shift right. The bitand is useless:

 

((index & BIT_BYTE_POSITION_MASK) >>> 3)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12290) Fix the misleading exception message in SharedSlot class

2019-04-22 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12290:


 Summary: Fix the misleading exception message in SharedSlot class
 Key: FLINK-12290
 URL: https://issues.apache.org/jira/browse/FLINK-12290
 Project: Flink
  Issue Type: Bug
Reporter: Liya Fan
Assignee: Liya Fan


The exception message in SharedSlot.releaseSlot is misleading. It says 
"SharedSlot is not empty and released". But the condition should be "SharedSlot 
is not released and empty."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12289) Fix bugs and typos in Memory manager

2019-04-22 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12289:


 Summary: Fix bugs and typos in Memory manager
 Key: FLINK-12289
 URL: https://issues.apache.org/jira/browse/FLINK-12289
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Operators
Reporter: Liya Fan
Assignee: Liya Fan


According to the JavaDoc,  MemoryManager.release method should throw an NPE if 
the input argument is null. 

In addition, there are some typos in class MemoryManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12223) HeapMemorySegment.getArray should return null after being freed

2019-04-17 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12223:


 Summary: HeapMemorySegment.getArray should return null after being 
freed
 Key: FLINK-12223
 URL: https://issues.apache.org/jira/browse/FLINK-12223
 Project: Flink
  Issue Type: Bug
Reporter: Liya Fan
Assignee: Liya Fan


According to the JavaDoc, HeapMemorySegment.getArray should return null, but it 
does not. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12216) Respect the number of bytes from input parameters in HybridMemorySegment

2019-04-16 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12216:


 Summary: Respect the number of bytes from input parameters in 
HybridMemorySegment
 Key: FLINK-12216
 URL: https://issues.apache.org/jira/browse/FLINK-12216
 Project: Flink
  Issue Type: Bug
Reporter: Liya Fan
Assignee: Liya Fan


For the following two methods in HybridMemorySegment class,

public final void get(int offset, ByteBuffer target, int numBytes)

public final void put(int offset, ByteBuffer source, int numBytes) 

the actual number of bytes read/written should be specified by the input 
parameter numBytes, but it does not for some types of ByteBuffer. Instead, it 
simply read/write until the end.

So this is a bug and I am going to fix it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12210) Fix a bug in AbstractPagedInputView.readLine

2019-04-16 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12210:


 Summary: Fix a bug in AbstractPagedInputView.readLine
 Key: FLINK-12210
 URL: https://issues.apache.org/jira/browse/FLINK-12210
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Operators
Reporter: Liya Fan
Assignee: Liya Fan


In AbstractPagedInputView.readLine, character '\r' is removed in two places, 
which is redundant. In addition, only trailing '\r' should be removed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12189) Fix bugs and typos in memory management related classes

2019-04-15 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12189:


 Summary: Fix bugs and typos in memory management related classes
 Key: FLINK-12189
 URL: https://issues.apache.org/jira/browse/FLINK-12189
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Operators
Reporter: Liya Fan
Assignee: Liya Fan


There are some bugs and typos in the memory related source code. For example,
 # In MemoryManager.release method, it should throw an NPE, if the input is 
null. But it does not. 
 # In HybridMemorySegment.put and get methods, the number of bytes read/written 
should be specified by the input parameter, rather than reading/writing until 
the end.
 # In HeapMemorySegment, the data member memory should be returned. This is 
because the member heapArray will not be reset to null after calling the 
release method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12162) Build error in flink-table-planner

2019-04-11 Thread Liya Fan (JIRA)
Liya Fan created FLINK-12162:


 Summary: Build error in flink-table-planner
 Key: FLINK-12162
 URL: https://issues.apache.org/jira/browse/FLINK-12162
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Runtime
Reporter: Liya Fan
Assignee: Liya Fan


There is a build error in project flink-table-planner:

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) 
on project flink-table-planner_2.11: Compilation failure
[ERROR] 
.../flink-table-planner/src/main/java/org/apache/flink/table/operations/ProjectionOperationFactory.java:[85,54]
 unreported exception X; must be caught or declared to be thrown

I am using JDK 1.8.0_45, maven 3.5.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11421) Providing more compilation options for code-generated operators

2019-01-23 Thread Liya Fan (JIRA)
Liya Fan created FLINK-11421:


 Summary: Providing more compilation options for code-generated 
operators
 Key: FLINK-11421
 URL: https://issues.apache.org/jira/browse/FLINK-11421
 Project: Flink
  Issue Type: New Feature
  Components: Core
Reporter: Liya Fan
Assignee: Liya Fan


Flink supports some operators (like Calc, Hash Agg, Hash Join, etc.) by code 
generation. That is, Flink generates their source code dynamically, and then 
compile it into Java Byte Code, which is load and executed at runtime.

 

By default, Flink compiles the generated source code by Janino. This is fast, 
as the compilation often finishes in hundreds of milliseconds. The generated 
Java Byte Code, however, is of poor quality. To illustrate, we use Java 
Compiler API (JCA) to compile the generated code. Experiments on TPC-H (1 TB) 
queries show that the E2E time can be more than 10% shorter, when operators are 
compiled by JCA, despite that it takes more time (a few seconds) to compile 
with JCA.

 

Therefore, we believe it is beneficial to compile generated code by JCA in the 
following scenarios: 1) For batch jobs, the E2E time is relatively long, so it 
is worth of spending more time compiling and generating high quality Java Byte 
Code. 2) For repeated stream jobs, the generated code will be compiled once and 
run many times. Therefore, it pays to spend more time compiling for the first 
time, and enjoy the high byte code qualities for later runs.

 

According to the above observations, we want to provide a compilation option 
(Janino, JCA, or dynamic) for Flink, so that the user can choose the one 
suitable for their specific scenario and obtain better performance whenever 
possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)