[ 
https://issues.apache.org/jira/browse/SOLR-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465465#comment-16465465
 ] 

Uwe Schindler commented on SOLR-12320:
--------------------------------------

Hi Mark,
thanks for bringing this up! I have thought about this multiple times, 
especially about cleaning up the temporary files! Because of our discussion 
around ContentStreams (sorry for this, I was overreacting) I just wanted to 
share my idea about this:

- You are right, small files should be stored in maybe byte[] blobs after 
parsing the multipart request. This is one opportunity (see below). I think you 
can configure commons-fileupload to do this!? We have to figure out (I looked 
at this 3 years ago).
- Large files may create temp files. The problem is indeed the cleanup. My idea 
after the discussion yesterday would be: The ContentStreams of a SolrRequest 
are all read-once. So my idea would be that all ContentStreams pointing to 
files from a multipart request, should have a customized "close()" method that 
deletes the file after calling "super.close()". This would make the cleanup 
easier. For the remaining files that may not have been read, the 
SolrRequestParsers class should call the delete on shutdown of request. Maybe 
we can add a hook in SolrDispatchFilter so it calls a "cleanup()" method in 
SolrRequestParsers.

Another aproach working completely without temp files might be:

- Change the ContentStreams from List to just Iterable or better Iterator. This 
would change them to be only consumed "in order" (and if only Iterator - once).
- While consuming this iterator, the multipart request parts are extracted from 
the request on-the fly. The ContentStream implementation just return an 
InputStream wrapper on top of the underlying ServletInputStream (similar to 
ZipInputStream for reading zip files). I have no idea if this is doable with 
commons-fileupload, I hope.
- The underlying ServletInputStream is never closed, the wrapper streams may be 
closed (a no-op on the underlying stream).
- Not sure how contrib-extraction handles it, but TIKA already creates another 
set of temp files if you pass a stream to it and it requires random access.

I would prefer the second variant, but this one may require larger changes.

> Not all multi-part post requests should create tmp files.
> ---------------------------------------------------------
>
>                 Key: SOLR-12320
>                 URL: https://issues.apache.org/jira/browse/SOLR-12320
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Minor
>
> We create tmp files for multi-part posts because often they are uploaded 
> files for Solr cell or something but we also sometimes write params only or 
> params and updates as multi-part post. These should not create any tmp files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to