Re: How can I extract cell data (content surrounded by ) from a in HTML response?

Thu, 26 Nov 2009 06:08:30 -0800

Hello, I'm facing other problems using XPATH.

I now use an HTML that contains less css attributes but a lot of whitespace,
such as:

<tr class="line2">
                                
                                
                                        <td>
*******/acq_web_front/etu_rec/InitConsultationDDFAction.do?numeroFL=604384
604384 </td>
                                        <td>313609133
</td>
                                        <td>******
</td>
                                        <td>20
</td>
                                        <td>63
</td>
                                        <td>******
</td>
                                        <td>23/11/09
</td>
                                
                                
                                
                                 
http://old.nabble.com/file/p26529936/html_special_chars_view.GIF
html_special_chars_view.GIF 
                        </tr>

I would like to retrieve //table/tr, then iterate on each <tr> and retrieve
every <td>'s content.
The goal is to send an HTTP request on each <tr> so that each <td>'s atomic
content of the current <tr> will become a parameter of the request, except,
the first <td> that contains an  . (It should create paramter with  's
content 604384.)

If I use regex I will capture every td's content into a group and
concatenate them to a string like "A=313609133&B=****". But I again failed
to create the proper regex.

With XPath, I don't know how to get this iteration done:
by writing //t...@class='line1' or @class='line2']/td, or, 
//t...@class='ligne1' or @class='ligne2']/td/text() 
 I will iterate on every td's atomic content but will be unable to capture   
in the first <td>.


And when I read debug sampler's output I saw variables like these:

detailsEtudes_30=25/11/09 
detailsEtudes_31=313609133 
detailsEtudes_32=**** 
detailsEtudes_33=20 
detailsEtudes_34=63 
detailsEtudes_35=Sans suite 
detailsEtudes_36=23/11/09 

which meant that I captured a <td> whose atomic content is 25/11/09,  but
placed it in a wrong line during iteration.


So what should I do with it?



Deepak Shetty wrote:
> 
> Is there any reason why you arent using XPATH?
> 
> Extractor1  = varCol8 = //table/tr[td[position()=1 and
> text()='313609133']]/td[8]
> Extractor2  = varCol9 = //table/tr[td[position()=1 and
> text()='313609133']]/td[9]
> 
> This assumes if there is an 8th column , there will alway be a column 9 ,
> im
> not quite sure how to use the extractor to extract both columns , but you
> should be able to loop through the values with an explicit counter.
> 
> regards
> deepak
> 
> 
> On Fri, Nov 20, 2009 at 9:08 AM, Deepak Shetty <[email protected]> wrote:
> 
>> Hi
>> If you need JMeter to iterate over variables with a ForEach , the
>> variable
>> names must have specific forms.
>>
>> http://jakarta.apache.org/jmeter/usermanual/component_reference.html#ForEach_Controller
>> So if you had an array of strings
>> //pseudo
>> for (int i=0;i<list.length;i++) {
>>     vars.put("varName_" + i,list[i] );
>> }
>> I cant remember offhand whether you also need varName_n=count (The total
>> number), try it out
>> Then you should be able to use a forEach with varName.
>>
>> Also you say you have an arrayList and are using
>>                vars.put("responseList", responseList);
>> That wont work , this method uses String, String. If you need to store
>> objects you have to use vars.putObject(key, object);
>>
>> While working with BSH always check your jmeter.log for errors.
>>
>> regards
>> deepak
>>
>>
>> On Fri, Nov 20, 2009 at 7:44 AM, rosiere <[email protected]> wrote:
>>
>>>
>>> Hello,
>>>
>>> Thanks for your explanation.
>>> In fact the HTML layout that I try to parse is stable and hardly
>>> subjected
>>> to future change, that's why I need to parse it.
>>>
>>> Now that I'm not goot at regex, I will use JMeter just to get the HTML
>>> response from an https-based web site, and to store parsing results in
>>> java
>>> objects like ArrayList.
>>>
>>> So I created some Http request samplers, then attached a BeanShell
>>> PostProcessor to it.
>>> In the BeanShell script, I wrote some logic with dom w3c and jtidy API,
>>> and
>>> now I can see the extracted cell contents by System.err.println() in my
>>> BeanShell.
>>>
>>> After that I had difficulties about JMeter variables usage. In my
>>> BeanShell
>>> script I created ArrayList objects and stored extracted texts in them,
>>> and
>>> put them into JMeter context:
>>>                vars.put("responseList", responseList);
>>>                vars.put("responseDateList", responseDateList);
>>> http://old.nabble.com/file/p26443545/BeanShellPostProcessor.gif
>>>
>>> After having parsed my HTML response, I would need a ForEach Controller
>>> to
>>> iterate on these List objects' elements (which are just an array of
>>> values
>>> in selected <td> elements), and to issue JDBC request to store them in
>>> database (or any other possible operations to send them out of JMeter).
>>> http://old.nabble.com/file/p26443545/ForEachController.gif
>>>
>>> However I was unable to get a ForEach Controller operate on objects in
>>> vars.
>>>
>>> What did I miss and what should I do to iterate on vars' content and run
>>> a
>>> sampler on each value in the iteration?
>>>
>>> With my best wishes,
>>>
>>> Rosière
>>>
>>>
>>> Deepak Shetty wrote:
>>> >
>>> > Hi
>>> > the regex you are using doesnt seem correct
>>> > [^tr]
>>> >  is any character that is not 't' or not 'r' it doesnt mean not the
>>> > sequence
>>> > tr.
>>> >
>>> > Plus if you are getting multiple <tr> instead of 1 that you expect
>>> your
>>> > regex is probably too greedy try replacing .* constructs with .*? or
>>> > modify
>>> > the regex
>>> >
>>> > In any case XPath is as dependent on HTML structure as a Regex is
>>> (e.g.
>>> > what
>>> > if you move to a tableless layout)
>>> >
>>> >
>>> > regards
>>> > deepak
>>> >
>>> > On Thu, Nov 19, 2009 at 8:17 AM, rosiere <[email protected]>
>>> wrote:
>>> >
>>> >>
>>> >> Hello,
>>> >>
>>> >> Thanks for your advice.
>>> >>
>>> >> I did applied case insensitive check: like this:
>>> >>
>>> >> (?is)<tr\sclass="tgDataLine.*1\)\" >([^tr].*)</tr>
>>> >>
>>> >> However I still face problem. Now I capture all <tr> elements in a
>>> same
>>> >> group instead of each <tr> element.
>>> >>
>>> >> I read in my jmeter.log these informations about matching:
>>> >>
>>> >> 2009/11/19 17:03:33 DEBUG - jmeter.extractor.RegexExtractor: Regex =
>>> >> (?is)<tr\sclass="tgDataLine.*1\)\" >([^tr].*)</tr>
>>> >> 2009/11/19 17:03:33 DEBUG - jmeter.extractor.RegexExtractor:
>>> >> RegexExtractor:
>>> >> Match found!
>>> >> 2009/11/19 17:03:33 DEBUG - jmeter.extractor.RegexExtractor:
>>> >> RegexExtractor:
>>> >> Template piece #0 = 1
>>> >> 2009/11/19 17:03:33 DEBUG - jmeter.extractor.RegexExtractor:
>>> >> RegexExtractor:
>>> >> Template piece #1 =
>>> >> 2009/11/19 17:03:33 DEBUG - jmeter.extractor.RegexExtractor: Regex
>>> >> Extractor
>>> >> result =
>>> >> <TD>....<TD>
>>> >> <TR>...</TR>
>>> >> ...
>>> >> <TR>....</TR>
>>> >> <TD>
>>> >>
>>> >>
>>> >> As for alternatives, I did want to parse a HTML with org.w3c.dom api,
>>> but
>>> >> dom methods like getElementsByTagName() are all case sensitive and
>>> may
>>> >> not
>>> >> be able to parse an HTML with both uppercase and lowercase tags.
>>> >>
>>> >> Besides, whenever the HTML page changes, I will have to rewrite my
>>> Java
>>> >> code
>>> >> based on dom api. So in order to minimize these unwanted effects on
>>> my
>>> >> Java
>>> >> code, I would still like to use regex, so that, whenever HTML
>>> structure
>>> >> changes, I need only change the regex in JMeter but not my java code
>>> that
>>> >> cosumes the extracted HTML portions.
>>> >>
>>> >>
>>> >>
>>> >> Deepak Shetty wrote:
>>> >> >
>>> >> > You should probably make the check case insensitive. but I agree
>>> with
>>> >> sebb
>>> >> > ,
>>> >> > parsing html constructs with regex is a pain and breaks quite
>>> >> frequently
>>> >> > regards
>>> >> > deepak
>>> >> >
>>> >> > On Wed, Nov 18, 2009 at 10:37 AM, Andre Arnold <[email protected]>
>>> >> wrote:
>>> >> >
>>> >> >> sebb schrieb:
>>> >> >> > On 18/11/2009, rosiere <[email protected]> wrote:
>>> >> >> >
>>> >> >> >>  Hello,
>>> >> >> >>
>>> >> >> >>  I found that JMeter's oro regex is somehow different from
>>> java's.
>>> >> >> >>
>>> >> >> >
>>> >> >> > Yes.
>>> >> >> >
>>> >> >> > But not all that different; and neither is particularly well
>>> suited
>>> >> to
>>> >> >> > this task.
>>> >> >> >
>>> >> >> > The XPath Extractor will probably be much easier to use.
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >>
>>> http://jakarta.apache.org/jmeter/usermanual/component_reference.html#XPath_Extractor
>>> >> >> >
>>> >> >> > This was discussed on the mailing list earlier this year.
>>> >> >> >
>>> >> >> >
>>> >> >> >>  Now I need to iterate on different <tr> that matches a
>>> pattern,
>>> >> then:
>>> >> >> >>   capture all the <td> elements within each <tr> , and select
>>> the
>>> >> 8th
>>> >> >> and 9th
>>> >> >> >>  <td>.
>>> >> >> >>
>>> >> >> >>  Since many <tr> elements appears in the HTML response, in
>>> order
>>> to
>>> >> do
>>> >> >> this I
>>> >> >> >>  have to capture <tr> line by line without including two lines
>>> in
>>> a
>>> >> >> same
>>> >> >> >>  group:
>>> >> >> >>
>>> >> >> >>  so I should avoid capturing  continuous <tr>..</tr><tr>..</tr>
>>> >> into
>>> >> >> the
>>> >> >> same
>>> >> >> >>  group.
>>> >> >> >>
>>> >> >> >>  By writing (?is)<tr\sclass="tgDataLine.*1\)\" >(.*)</tr> I
>>> will
>>> >> >> capture
>>> >> >> only
>>> >> >> >>  one group that contains many real <tr> elements
>>> >> >> >>  So what should I write in the regex?
>>> >> >> >>
>>> >> >> >>
>>> >> >> If you still need a pattern to match your needs.
>>> >> >> I found that the following matches your the number you wanted and
>>> the
>>> >> >> following column value.
>>> >> >>
>>> >> >> reference: ref
>>> >> >> pattern:     (?s)<TR.+?<TD.+?>([1-9|0]+?)</TD.+?<TD.+?>(.+?)</TD>
>>> >> >> template:  $1$$2$
>>> >> >> match :     1
>>> >> >>
>>> >> >> In ref_g1 you'll find the number.
>>> >> >> In ref_g2 you'll find the following column value.
>>> >> >>
>>> >> >> To catch all the matches you need to increment a counter for the
>>> match
>>> >> >> and check wether there is another one or not.
>>> >> >>
>>> >> >> Your Testplan should look sth like this:
>>> >> >>
>>> >> >> -while controller (${__javaScript("${ref}"!="error")}  )
>>> >> >> --counter (from 1 with increment 1 for the regex match value)
>>> >> >> --Http Sampler (to get your site)
>>> >> >> ---RegEx Extractor (as shown above)
>>> >> >> --if controller( same as while controller--> ${ref}"!="error" )
>>> >> >> ---your jdbc action (use ref_g1 & ref_g2)
>>> >> >>
>>> >> >>
>>> >> >> Hope I got your problem right.
>>> >> >>
>>> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: [email protected]
>>> >> >> For additional commands, e-mail:
>>> [email protected]
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://old.nabble.com/How-can-I-extract-cell-data-%28content-surrounded-by-%3Ctd%3E%3C-td%3E%29-from-a-%3Ctable%3E-in-HTML-response--tp26371440p26421379.html
>>> >> Sent from the JMeter - User mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/How-can-I-extract-cell-data-%28content-surrounded-by-%3Ctd%3E%3C-td%3E%29-from-a-%3Ctable%3E-in-HTML-response--tp26371440p26443545.html
>>> Sent from the JMeter - User mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/How-can-I-extract-cell-data-%28content-surrounded-by-%3Ctd%3E%3C-td%3E%29-from-a-%3Ctable%3E-in-HTML-response--tp26371440p26529936.html
Sent from the JMeter - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to