[ 
https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659320#comment-16659320
 ] 

Markus Jelsma commented on TIKA-2759:
-------------------------------------

Thanks [~talli...@apache.org]!

> ScriptsExtractor incorrectly reports Javascript to characters() in SAX 
> ContentHandler
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-2759
>                 URL: https://issues.apache.org/jira/browse/TIKA-2759
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.18
>            Reporter: Markus Jelsma
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0, 1.20
>
>         Attachments: petrolicious.html
>
>
> We extract Javascript as text content while instead it is actually a script 
> tag with base64 inline. This inline code is decoded and reported in the 
> characters() method of our custom ContentHandler, and ends up as text being 
> extracted, but it seems the Javascript start tag itself is never reported to 
> startElement(). The Javascript is reported to characters() after we left the 
> head and entered the body.
> HTML file is attached
> The following script tag:
> {code}
>   <script 
> src="data:text/javascript;base64,Oyh3aW5kb3cuanExODN8fGpRdWVyeSkoZnVuY3Rpb24oJCl7bmV3IEltcHJvdmVkQUpBWExvZ2luKHsNCmlkOiAxNTcsDQppc0d1ZXN0OiAxLA0Kb2F1dGg6IHsiZmFjZWJvb2siOiJodHRwczpcL1wvd3d3LmZhY2Vib29rLmNvbVwvZGlhbG9nXC9vYXV0aD9zY29wZT1lbWFpbCZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9MTcyODk0MjQzMDY1MDQ4NiZyZWRpcmVjdF91cmk9aHR0cCUzQSUyRiUyRnBldHJvbGljaW91cy5jb20lMkZpbmRleC5waHAlM0ZvcHRpb24lM0Rjb21faW1wcm92ZWRfYWpheF9sb2dpbiUyNnRhc2slM0RmYWNlYm9vayIsImdvb2dsZSI6Imh0dHBzOlwvXC9hY2NvdW50cy5nb29nbGUuY29tXC9vXC9vYXV0aDJcL2F1dGg/c2NvcGU9aHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8uZW1haWwraHR0cHMlM0ElMkYlMkZ3d3cuZ29vZ2xlYXBpcy5jb20lMkZhdXRoJTJGdXNlcmluZm8ucHJvZmlsZSZyZXNwb25zZV90eXBlPWNvZGUmZGlzcGxheT1wb3B1cCZjbGllbnRfaWQ9ODQ5NDk3NjQ3ODUzLW1mOThqNGdlOGkwYzlkaTFrbG9zc2YxbmdibWI2cG12LmFwcHMuZ29vZ2xldXNlcmNvbnRlbnQuY29tJnJlZGlyZWN0X3VyaT1odHRwJTNBJTJGJTJGcGV0cm9saWNpb3VzLmNvbSUyRmluZGV4LnBocCUzRm9wdGlvbiUzRGNvbV9pbXByb3ZlZF9hamF4X2xvZ2luJTI2dGFzayUzRGdvb2dsZSJ9LA0KYmdPcGFjaXR5OiAwLjQsDQpyZXR1cm5Vcmw6ICcvaXMtdGhpcy1kdXRjaC1jbGFzc2ljLWZpbmFsbHktYXMtY29vbC1hcy1hLWJtdycsDQpib3JkZXI6IHBhcnNlSW50KCdmNWY1ZjV8KnwzfCp8YzRjNGM0fCp8Nycuc3BsaXQoJ3wqfCcpWzFdKSwNCnBhZGRpbmc6IDQsDQp1c2VBSkFYOiAwLA0Kb3BlbkV2ZW50OiAnb25jbGljaycsDQp3bmRDZW50ZXI6IDAsDQpyZWdQb3B1cDogMSwNCmR1cjogMzAwLA0KdGltZW91dDogMCwNCmJhc2U6ICcvJywNCnRoZW1lOiAncGV0cm9saWNpb3VzJywNCnNvY2lhbFByb2ZpbGU6ICcnLA0Kc29jaWFsVHlwZTogJ2J0bkljbycsDQpjc3NQYXRoOiAnL21vZHVsZXMvbW9kX2ltcHJvdmVkX2FqYXhfbG9naW4vY2FjaGUvMTU3LzNkNDE4Mzk2NDk2N2Y2ZWVlYjI5MTdhOTI2OGM2MTIxLmNzcycsDQpyZWdQYWdlOiAnam9vbWxhJywNCmNhcHRjaGE6ICcnLA0Kc2hvd0hpbnQ6IDAsDQpnZW9sb2NhdGlvbjogZmFsc2UsDQp3aW5kb3dBbmltOiAnJw0KfSl9KTs="
>  type="text/javascript"></script>
> {code}
> gets reported outside the head (in html.p) as:
> {code}
> ;(window.jq183||jQuery)(function($){new ImprovedAJAXLogin({
> id: 157,
> isGuest: 1,
> oauth: 
> {"facebook":"https:\/\/www.facebook.com\/dialog\/oauth?scope=email&response_type=code&display=popup&client_id=1728942430650486&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dfacebook","google":"https:\/\/accounts.google.com\/o\/oauth2\/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.profile&response_type=code&display=popup&client_id=849497647853-mf98j4ge8i0c9di1klossf1ngbmb6pmv.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Fpetrolicious.com%2Findex.php%3Foption%3Dcom_improved_ajax_login%26task%3Dgoogle"},
> bgOpacity: 0.4,
> returnUrl: '/is-this-dutch-classic-finally-as-cool-as-a-bmw',
> border: parseInt('f5f5f5|*|3|*|c4c4c4|*|7'.split('|*|')[1]),
> padding: 4,
> useAJAX: 0,
> openEvent: 'onclick',
> wndCenter: 0,
> regPopup: 1,
> dur: 300,
> timeout: 0,
> base: '/',
> theme: 'petrolicious',
> socialProfile: '',
> socialType: 'btnIco',
> cssPath: 
> '/modules/mod_improved_ajax_login/cache/157/3d4183964967f6eeeb2917a9268c6121.css',
> regPage: 'joomla',
> captcha: '',
> showHint: 0,
> geolocation: false,
> windowAnim: ''
> })});
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to