Beginners Digest, Vol 15, Issue 21

beginners-request Wed, 30 Sep 2009 16:57:10 -0700

Send Beginners mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.haskell.org/mailman/listinfo/beginners
or, via email, send a message with subject or body 'help' to
        [email protected]


You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beginners digest..."


Today's Topics:

   1. Re:  remove XML tags using Text.Regex.Posix (Lyndon Maydwell)
   2. Re:  remove XML tags using Text.Regex.Posix (aditya siram)
   3. Re:  remove XML tags using Text.Regex.Posix (Magnus Therning)
   4.  Re: remove XML tags using Text.Regex.Posix (Christian Maeder)
   5. Re:  remove XML tags using Text.Regex.Posix (Jan Jakubuv)
   6. Re:  remove XML tags using Text.Regex.Posix (Tom Tobin)
   7.  subtle dissimilarity between fromInteger and     fromIntegral
      (Hong Yang)
   8. Re:  subtle dissimilarity between fromInteger and
      fromIntegral (Joe Fredette)


----------------------------------------------------------------------

Message: 1
Date: Wed, 30 Sep 2009 14:27:08 +0800
From: Lyndon Maydwell <[email protected]>
Subject: Re: [Haskell-beginners] remove XML tags using
        Text.Regex.Posix
To: Magnus Therning <[email protected]>
Cc: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset=UTF-8

HXT should be able to do what you're after quite easily from what I've seen.

On Wed, Sep 30, 2009 at 1:58 PM, Magnus Therning <[email protected]> wrote:
> On Tue, Sep 29, 2009 at 12:25:07PM -0700, Robert Ziemba wrote:
>> I have been working with the regular expression package (Text.Regex.Posix).
>> Â My hope was to find a simple way to remove a pair of XML tags from a short
>> string.
>>
>> I have something like this "<tag>Data</tag>" and would like to extract
>> 'Data'. Â There is only one tag pair, no nesting, and I know exactly what the
>> tag is.
>>
>> My first attempt was this:
>>
>> Â  "<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
>>
>> result: Â "123"
>>
>> Upon further experimenting I realized that it only works with more than 2
>> digits in 'Data'. Â I occured to me that my thinking on how this regular
>> expression works was not correct - but I don't understand why it works at
>> all for 3 or more digits.
>>
>> Can anyone help me understand this result and perhaps suggest another
>> strategy? Â Thank you.
>
> Personally I would have used tagsoup for this sort of thing. Â Keep in mind 
> the
> eternal words
>
> Â Some people, when confronted with a problem, think 'I know, I'll use
> Â regular expressions.' Now they have two problems.
> Â  Â  Â  -- Jamie Zawinski
>
> As you so nicely demonstrated yourself ;-)
>
> /M
>
> --
> Magnus Therning Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â (OpenPGP: 0xAB4DFBA4)
> magnusï¼ therningï¼org Â  Â  Â  Â  Â Jabber: magnusï¼ therningï¼org
> http://therning.org/magnus Â  Â  Â  Â  identi.ca|twitter: magthe
>
> _______________________________________________
> Beginners mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/beginners
>
>


------------------------------

Message: 2
Date: Wed, 30 Sep 2009 02:06:05 -0500
From: aditya siram <[email protected]>
Subject: Re: [Haskell-beginners] remove XML tags using
        Text.Regex.Posix
To: Robert Ziemba <[email protected]>
Cc: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

This is how I did it using the HXT library :

Prelude Text.XML.HXT.Parser.XmlParsec Text.XML.HXT.Arrow.XmlIOStateArrow
Text.XML.HXT.Arrow> runX (readString [] "<tag>123</tag>" >>> getXPathTrees
"tag" >>> getChildren >>> getText)
["123"]

Everything after "Prelude" upto the first ">" is what you have to import to
make this work.
-"readString" converts the input string into a internal representation of an
XML tree
-"getXPathTrees" sets the path to all <tag>'s,
-"getChildren" narrows it down to the data between <tag> and </tag>,
-"getText" extracts all the data between those tags,
-"runX" fires up the whole process and returns the results as a list in the
IO Monad.

hth,
deech

On Tue, Sep 29, 2009 at 2:25 PM, Robert Ziemba <[email protected]> wrote:

> I have been working with the regular expression package (Text.Regex.Posix).
>  My hope was to find a simple way to remove a pair of XML tags from a short
> string.
>
> I have something like this "<tag>Data</tag>" and would like to extract
> 'Data'.  There is only one tag pair, no nesting, and I know exactly what the
> tag is.
>
> My first attempt was this:
>
>   "<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
>
> result:  "123"
>
> Upon further experimenting I realized that it only works with more than 2
> digits in 'Data'.  I occured to me that my thinking on how this regular
> expression works was not correct - but I don't understand why it works at
> all for 3 or more digits.
>
> Can anyone help me understand this result and perhaps suggest another
> strategy?  Thank you.
>
> _______________________________________________
> Beginners mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/beginners
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://www.haskell.org/pipermail/beginners/attachments/20090930/43297f2e/attachment-0001.html

------------------------------

Message: 3
Date: Wed, 30 Sep 2009 08:59:35 +0100
From: Magnus Therning <[email protected]>
Subject: Re: [Haskell-beginners] remove XML tags using
        Text.Regex.Posix
To: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset=UTF-8

On Wed, Sep 30, 2009 at 6:58 AM, Magnus Therning <[email protected]> wrote:
[..]
> Personally I would have used tagsoup for this sort of thing. Â Keep in mind 
> the
> eternal words
>
> Â Some people, when confronted with a problem, think 'I know, I'll use
> Â regular expressions.' Now they have two problems.
> Â  Â  Â  -- Jamie Zawinski
>
> As you so nicely demonstrated yourself ;-)

Here's a quick and dirty solution using tagsoup:

% cat file.xml
<tag>123</tag>
<tag>456</tag>
<tag>789</tag>

Text.HTML.Download Text.HTML.TagSoup> tags <- openItem "file.xml"
Text.HTML.Download Text.HTML.TagSoup> map (fromTagText . head . tail)
$ partitions (TagOpen "tag" [] ~==) (parseTags tags)
["123","456","789"]

/M

-- 
Magnus Therning                        (OpenPGP: 0xAB4DFBA4)
magnusï¼ therningï¼org          Jabber: magnusï¼ therningï¼org
http://therning.org/magnus         identi.ca|twitter: magthe


------------------------------

Message: 4
Date: Wed, 30 Sep 2009 14:48:58 +0200
From: Christian Maeder <[email protected]>
Subject: [Haskell-beginners] Re: remove XML tags using
        Text.Regex.Posix
To: Robert Ziemba <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1

I think regexs are a pain und would suggest the xml-light package for
your purpose, which is the smallest xml library. (Or use take, drop,
isPrefixOf and isSuffixOf to chop of your tags manually.)

http://hackage.haskell.org/package/xml

Cheers Christian

Prelude Text.XML.Light> concatMap strContent . onlyElems $ parseXML
 "<tag>123</tag>"
"123"



Robert Ziemba wrote:
> I have been working with the regular expression package
> (Text.Regex.Posix).  My hope was to find a simple way to remove a pair
> of XML tags from a short string.  
> 
> I have something like this "<tag>Data</tag>" and would like to extract
> 'Data'.  There is only one tag pair, no nesting, and I know exactly what
> the tag is.  
> 
> My first attempt was this:  
> 
>   "<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
> 
> result:  "123"
> 
> Upon further experimenting I realized that it only works with more than
> 2 digits in 'Data'.  I occured to me that my thinking on how this
> regular expression works was not correct - but I don't understand why it
> works at all for 3 or more digits. 
> 
> Can anyone help me understand this result and perhaps suggest another
> strategy?  Thank you.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beginners mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/beginners


------------------------------

Message: 5
Date: Wed, 30 Sep 2009 17:11:46 +0100
From: Jan Jakubuv <[email protected]>
Subject: Re: [Haskell-beginners] remove XML tags using
        Text.Regex.Posix
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8

Hi Robert,

On Tue, Sep 29, 2009 at 12:25:07PM -0700, Robert Ziemba wrote:
> I have been working with the regular expression package (Text.Regex.Posix).
>  My hope was to find a simple way to remove a pair of XML tags from a short
> string.
> 
> I have something like this "<tag>Data</tag>" and would like to extract
> 'Data'.  There is only one tag pair, no nesting, and I know exactly what the
> tag is.
> 

This is so simple that I would not recommend anything other than regular
expressions. Use the following pattern:

    pat = "<tag>(.*)</tag>"

It creates a group withing the matched string containing the data (it is
done using parenthesis). Use `[[String]]` as a result type and you receive a
list of matches where each match is described by a list of strings whose
first member is the whole matched string (including <tag> and </tag>) and it
is followed by values of groups (in our case we have just one group). Thus:

    *Main> "text<tag>data</tag>text" =~ pat :: [[String]]
    [["<tag>data</tag>","data"]]

It is easy extract the data using `(!!)` and `head`:

    *Main> (!! 1) . head $ ("text<tag>7</tag>text" =~ pat :: [[String]]) 
    "7"

> My first attempt was this:
> 
>   "<tag>123</tag>" =~ "[^<tag>].+[^</tag>]"::String
> 
> result:  "123"
> 

The problem with your pattern is that `[^<tag>]` doesn't mean what you think
it does. Its meaning is âone character which is not `<`, `t`, `a`, or `>`â
as Patrick already described in his mail.

> Upon further experimenting I realized that it only works with more than 2
> digits in 'Data'.  I occured to me that my thinking on how this regular
> expression works was not correct - but I don't understand why it works at
> all for 3 or more digits.
> 

It doesn't work for all 3 or more digits:
        
    *Main> "<tag>tag</tag>" =~ "[^<tag>].+[^</tag>]" :: String
    ""

Briefly, it doesn't work when the data contains one of characters `<`, `t`,
`a`, `g`, `>`.

Finally, consider using

    pat = "<tag>([^<]*)</tag>"

which works with more tags in the same line as well.

Sincerely,
    jan.




-- 
Heriot-Watt University is a Scottish charity
registered under charity number SC000278.



------------------------------

Message: 6
Date: Wed, 30 Sep 2009 12:30:31 -0500
From: Tom Tobin <[email protected]>
Subject: Re: [Haskell-beginners] remove XML tags using
        Text.Regex.Posix
To: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset=UTF-8

On Wed, Sep 30, 2009 at 11:11 AM, Jan Jakubuv <[email protected]> wrote:
> This is so simple that I would not recommend anything other than regular
> expressions. Use the following pattern:
>
> Â  Â pat = "<tag>(.*)</tag>"

Don't use this; the * operator is greedy by default, meaning that will
match stuff like "<tag>foo</tag>bar<tag>baz</tag>", and your data will
end up being "foo</tag>bar<tag>baz".  In other words, a greedy
operator tries to consume as much of the string as it possibly can
while still matching.  If that regex module supports non-greedy
operators, you want something like this:

pat = "<tag>(.*?)</tag>"

A "?" after a greedy operator makes it non-greedy, meaning it will try
to match while consuming as little of the string as it can.  If the
posix regex module doesn't support this, the PCRE-based one should.


------------------------------

Message: 7
Date: Wed, 30 Sep 2009 18:41:35 -0500
From: Hong Yang <[email protected]>
Subject: [Haskell-beginners] subtle dissimilarity between fromInteger
        and     fromIntegral
To: [email protected]
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="utf-8"

Hi,

Can someone explain the subtle dissimilarity between fromInteger and
fromIntegral?

Thanks,

Hong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://www.haskell.org/pipermail/beginners/attachments/20090930/b776f6bb/attachment-0001.html

------------------------------

Message: 8
Date: Wed, 30 Sep 2009 19:57:00 -0400
From: Joe Fredette <[email protected]>
Subject: Re: [Haskell-beginners] subtle dissimilarity between
        fromInteger and fromIntegral
To: Hong Yang <[email protected]>
Cc: Haskell Beginners <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

fromInteger takes an integer to any instance of the Num class. The num  
class provides this function, because (at least in theory) any Integer  
"is" (in some sense) an element of a Num instancing class.

fromIntegral has type `(Integral a, Num b) => a -> b` which means it  
takes any _integral) type (on that supports div and mod, and a  
function "toInteger" which takes an integral to an integer) to a num  
type. I imagine that "fromIntegral" is probably implemented

     fromIntegral = fromInteger . toInteger

HTH

/Joe

On Sep 30, 2009, at 7:41 PM, Hong Yang wrote:

> Hi,
>
> Can someone explain the subtle dissimilarity between fromInteger and  
> fromIntegral?
>
> Thanks,
>
> Hong
> _______________________________________________
> Beginners mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/beginners



------------------------------

_______________________________________________
Beginners mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/beginners


End of Beginners Digest, Vol 15, Issue 21
*****************************************

Beginners Digest, Vol 15, Issue 21

Reply via email to