[jira] [Commented] (TIKA-1401) occured infinite loop using tika library
[ https://issues.apache.org/jira/browse/TIKA-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530284#comment-14530284 ] Matthias Krueger commented on TIKA-1401: I'm preparing a patch for this issue. I think we should avoid all DTD parsing in XML root extractor. I need some input on how important Android code compatibility is. There are two options: 1. Continue to use SAX and use Xerces https://xerces.apache.org/xerces2-j/features.html#disallow-doctype-decl * This will prevent any XML containing a DTD from being parsed. * Advantages: This will solve all DTD related security issues. Code will still compile for Android. * Disadvantage: Root element extraction will fail even for XML that is fine but just happens to contain a DTD. 2. Re-implement using StAX and javax.xml.stream.supportDTD=false * This will have the DTD part of the XML being skipped (but no exception thrown). * Advantages: Will solve all DTD related security issues. Root elements can still be extracted for XML that contains a DTD (if there is no entity used in any of the root elements attribute values). Code is slightly more elegant as it avoids the ignorable Exception. * Disadvantage: StAX not supported on Android. occured infinite loop using tika library Key: TIKA-1401 URL: https://issues.apache.org/jira/browse/TIKA-1401 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.5 Reporter: Robin.Hwang Hi 1. Save the file with the following content as errorfile.xml {code} ?xml version=1.0? !DOCTYPE billion [ !ELEMENT billion (#PCDATA) !ENTITY laugh0
[jira] [Commented] (TIKA-1401) occured infinite loop using tika library
[ https://issues.apache.org/jira/browse/TIKA-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372172#comment-14372172 ] Tyler Palsulich commented on TIKA-1401: --- Still loop infinitely with Tika 1.8-SNAPSHOT. occured infinite loop using tika library Key: TIKA-1401 URL: https://issues.apache.org/jira/browse/TIKA-1401 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.5 Reporter: Robin.Hwang Hi 1. Save the file with the following content as errorfile.xml {code} ?xml version=1.0? !DOCTYPE billion [ !ELEMENT billion (#PCDATA) !ENTITY laugh0
[jira] [Commented] (TIKA-1401) occured infinite loop using tika library
[ https://issues.apache.org/jira/browse/TIKA-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110536#comment-14110536 ] Robin.Hwang commented on TIKA-1401: --- Thanks for the answer Confirmation is, please tell me. I considering alternative method using timeout in thread.. I want to hear from your feedback. attach the code below import java.io.InputStream; import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.concurrent.TimeUnit; import java.util.concurrent.TimeoutException; import org.apache.tika.Tika; import org.apache.tika.mime.MimeTypes; public class TikaFileTypeDetector implements CallableString{ private static final int TIMEOUT = 10; private final Tika tika = new Tika(); private InputStream inputStream; private TikaFileTypeDetector(InputStream inputStream) { this.inputStream = inputStream; } @Override public String call() throws Exception { String mimetype = MimeTypes.OCTET_STREAM; mimetype = tika.detect(inputStream); return mimetype; } public static String detect(InputStream inputStream) { if(inputStream == null) { return null; } String mimetype = MimeTypes.OCTET_STREAM; TikaFileTypeDetector detector = new TikaFileTypeDetector(inputStream); ExecutorService executor = Executors.newSingleThreadExecutor(); FutureString future = executor.submit(detector); try { mimetype = future.get(TIMEOUT, TimeUnit.SECONDS); } catch (InterruptedException e) { e.printStackTrace(); } catch (ExecutionException e) { e.printStackTrace(); } catch (TimeoutException e) { e.printStackTrace(); } finally { executor.shutdownNow(); } return mimetype; } } occured infinite loop using tika library Key: TIKA-1401 URL: https://issues.apache.org/jira/browse/TIKA-1401 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.5 Reporter: Robin.Hwang Hi 1. Save the file with the following content as errorfile.xml ?xml version=1.0? !DOCTYPE billion [ !ELEMENT billion (#PCDATA) !ENTITY laugh0
[jira] [Commented] (TIKA-1401) occured infinite loop using tika library
[ https://issues.apache.org/jira/browse/TIKA-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110294#comment-14110294 ] Nick Burch commented on TIKA-1401: -- At first glance, it looks like we might need to bring over the naughty xml protection we have in the xml parsing side to the xml detector as well. Hopefully one of our xml experts can be along shortly to confirm! occured infinite loop using tika library Key: TIKA-1401 URL: https://issues.apache.org/jira/browse/TIKA-1401 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.5 Reporter: Robin.Hwang Hi 1. Save the file with the following content as errorfile.xml ?xml version=1.0? !DOCTYPE billion [ !ELEMENT billion (#PCDATA) !ENTITY laugh0