Re: [Mojolicious] Re: Feature request - Mojo::DOM, have a line number for each element?

Ekki Plicht Sun, 24 Jul 2016 14:01:35 -0700

Am Freitag, 22. Juli 2016 20:51:06 UTC+2 schrieb Dan Book:
>
> You could try something like this...
>
> $dom->find('*')->map(sub { state $i = 0; $_->{_myapp_counter} = $i++ });
>
>
Nice idea, but wouldn't work.
A doc like 
<h2>1</h2>
   <p>bar</p>
<h2>2</h2>
  <p>bar</p>


Would result in numbering
1: 1st h1
2: 2nd h2
3. 1st p
4. 2nd p

whereas the sequence actually is 1 3 2 4. And that's all I want: the 
sequence the elements showed up in the original html source.






> Alternatively you could go through $dom->find('*') in order and test 
> $_->tag or other methods to collect the tags in the order you want. This 
> would only work if your criteria are simple enough that you don't really 
> need the CSS selector to find them.
>


I thought about this, but this would render all use of selectors futile and 
probably take much longer. I would have to loop through all elements, check 
each if it is a match (out of maybe a dozen, with very simple selectors) 
amd then work on that element. But I found this concept so inelegant that I 
rather went back to HTML::HTML5::Parser which has a source_line() function. 
Ok, the penalty is to deal with XML and Xpath and so on. Mojo:_:DOM is so 
much easier to use in most other cases...

Regards,
Ekki




 

>
> On Fri, Jul 22, 2016 at 1:24 PM, Ekki Plicht <[email protected] 
> <javascript:>> wrote:
>
>>
>> Am Freitag, 22. Juli 2016 16:28:07 UTC+2 schrieb Scott Wiersdorf:
>>>
>>> You can use map() to do that:
>>>
>>> $dom->find('div')->map(sub { state $i = 0; say $i++ . " $_" });
>>>
>>
>> Right, that would give me the proper sequence for all <div>s. 
>> And then I would have another sequence for all <h1>s, and another for all 
>> <td>s and another for all <p>s, and so on.
>>
>> What I need is one sequence which gives me the right order of all tags I 
>> am looking at.
>>
>> Cheers,
>> Ekki
>>
>>
>>
>>
>>
>>
>>  
>>
>>>
>>>
>>> Scott
>>>
>>> On Friday, July 22, 2016 at 1:44:45 AM UTC-6, Ekki Plicht wrote:
>>>>
>>>> I use Mojo::DOM for various web scraping and analysis, very easy, very 
>>>> fast, nice.
>>>>
>>>> Usually I am interested in only a few tags, not the entire dom. So I 
>>>> use ->find() to select the interesting nodes, check some facts on the 
>>>> found 
>>>> nodes and store the results in a database for later viewing.
>>>>
>>>> For this later viewing I would love to retain the sequence in which the 
>>>> nodes are in the source. Unfortunately all information about the sequence 
>>>> of tags is lost when I use ->find(). 
>>>>
>>>> The parser I used to use before (HMTL::HTML5::Parser) does provide a 
>>>> line-number function for each element. This is enough for me to retain the 
>>>> sequence of nodes, the absolute position is not important.
>>>>
>>>> Do you think it would be possible to extend Mojo::DOM to provide a line 
>>>> number for each element? I understand this this might be insufficient for 
>>>> the situation where many tags are on the same line, but that's too bad 
>>>> then... 
>>>>
>>>> TIA,
>>>> Ekki
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Mojolicious" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/mojolicious.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Mojolicious" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/mojolicious.
For more options, visit https://groups.google.com/d/optout.

Re: [Mojolicious] Re: Feature request - Mojo::DOM, have a line number for each element?

Reply via email to