markap14 commented on a change in pull request #5011:
URL: https://github.com/apache/nifi/pull/5011#discussion_r616704347
##########
File path:
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/schema/inference/InferSchemaAccessStrategy.java
##########
@@ -43,7 +43,7 @@ public InferSchemaAccessStrategy(final RecordSourceFactory<T>
recordSourceFactor
public RecordSchema getSchema(final Map<String, String> variables, final
InputStream contentStream, final RecordSchema readSchema) throws IOException {
// We expect to be able to mark/reset any length because we expect
that the underlying stream here will be a ContentClaimInputStream, which is
able to
// re-read the content regardless of how much data is read.
- contentStream.mark(10_000_000);
+ contentStream.mark(Integer.MAX_VALUE);
Review comment:
I think we want to keep this at a smaller value. The general contract of
`InputStream` says that when `mark` is called, the stream must remember at
least that many bytes but is free to remember more. As the comment above
explains, the general expectation is that the stream will be of type
ContentClaimInputStream. In that case, the value passed to `mark` is ignored
because it can always roll back to the beginning of the stream (by re-reading
the file under the hood).
But keeping the 10 MB limit (or even a 1 MB limit would probably be okay)
means that if the caller does wrap the InputStream in a BufferedInputStream,
then we have the chance remember this amount of data. If it's changed to
Integer.MAX_VALUE, that bound is basically removed, which can lead to
OutOfMemoryError very easily, and that should be avoided at almost any cost as
the entire JVM can become defunct.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]