Hi, 1. For PublicBAOS, it extends the ByteArrayOutputStream, so that it can grow automatically. As we do not define the init size, it should be 32. I think giving a more intelligent init size is good, and it is the best if the size == the real page size before the page has to be flushed.
Which factors impact the page size: (1). The max point number ($n$) in a page: the size = the data type size * $n$. (2). The max page size $P$. (3). The memtable size $M$ and the number of active Chunks $c$ in the memtable: $M$/$c$. The real scenarios may be more complicated. If we can not find a intelligent way, at least what we can do is, we can set the init size as an advanced parameter to let DBA tune. For more info, please read the funcion ` void checkPageSizeAndMayOpenANewPage()` in `ChunkWriterImpl` If you have more questions, do not hesitate to send email. Best, ---------------------------------- Xiangdong Huang School of Software, Tsinghua University 黄向东 清华大学 软件学院 atoiLiu <[email protected]> 于2019年12月12日周四 下午6:06写道: > Hi, > I understand the process of writing TsFile, but there is a question that > is not very clear to me. I hope someone can give me some advice. > TsFile has the concept of Page, which consists of two pieces of data that > grow on each other, > 1. timeOut > 2. ValueOut > Both are cached by the PublicBAOS class, where I notice it extends > ByteArrayOutputStream and doesn't initialize the capacity when used. > private PageWriter(Encoder timeEncoder, Encoder valueEncoder) { > this.timeOut = new PublicBAOS(); > this.valueOut = new PublicBAOS(); > this.timeEncoder = timeEncoder; > this.valueEncoder = valueEncoder; > } > public PublicBAOS() { > super(); > } > public ByteArrayOutputStream() { > this(32); > } > I noticed that we had a page size that was about 64K in the design > expectation, > and this will make the cache constantly grow and need to copy the data > again, > I think this is a waste, so I want to add an initial value to it, so how > much is appropriate? > private void grow(int minCapacity) { > // overflow-conscious code > int oldCapacity = buf.length; > int newCapacity = oldCapacity << 1; > if (newCapacity - minCapacity < 0) > newCapacity = minCapacity; > if (newCapacity - MAX_ARRAY_SIZE > 0) > newCapacity = hugeCapacity(minCapacity); > buf = Arrays.copyOf(buf, newCapacity); > } > In the implementation of ByteArrayOutputStream, the default is to double > the extension. > In the write flow of page, the default is to write first and then check if > the data is larger than 64K, which may make the data larger than 64K. > In this case it would be wrong to set 64K, which would waste more > resources > and I think the initial value should be less than 64K, because it might be > OOM when the time series is very large, > So I don't really know how much to set > > I don't know whether I am correct in thinking this way. I am looking > forward to your reply > thanks again
