jnioche opened a new pull request, #2803:
URL: https://github.com/apache/tika/pull/2803

     ## Summary                                                                 
                                                                                
                                                                                
 
     - Adds `maxPages` field to `PDFParserConfig` (default `-1`, no limit)      
                                                                                
                                                                               
     - `AbstractPDF2XHTML.processPages()` breaks out of the page loop early 
when the limit is reached, skipping all text extraction, font mapping, and 
content stream work for subsequent pages
     - Setter validates that the value is -1 or >= 1, throwing 
`IllegalArgumentException` otherwise                                            
                                                                                
                  
                                                                                
                                                                                
                                                                                
 
     ## Performance                                                             
                                                                                
                                                                                
 
     Benchmarked on a 738-page PDF:                                             
                                                                                
                                                                                
 
     - Full parse: ~1,800 ms                                                    
                                                                                
                                                                                
 
     - First 5 pages: ~135 ms (13× faster) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to