[MediaWiki-commits] [Gerrit] mediawiki...Wikispeech[master]: Add API for segmenting text

jenkins-bot (Code Review) Tue, 11 Jul 2017 08:20:07 -0700

jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/358378 )


Change subject: Add API for segmenting text
......................................................................


Add API for segmenting text

The API will initially be used to clean and segment HTML. It can also
respond with only cleaned text or the original output. This can be
used for debugging or outside Wikispeech.

Cleaner and Segmenter is now instantiated, since there were a few
variables that were passed through multiple functions.

The config variable for removed tags now takes boolean or string as
values (and not objects), as it currently only allows CSS class to be
specified.

Bug: T164250
Change-Id: I78db6c6a64e9d04e6907d2eb96af47ca368cf82e
---
M Hooks.php
M extension.json
M i18n/en.json
M i18n/qqq.json
M includes/Cleaner.php
M includes/Segmenter.php
A includes/WikispeechApi.php
M tests/phpunit/CleanerTest.php
M tests/phpunit/SegmenterTest.php
A tests/phpunit/WikispeechApiTest.php
10 files changed, 505 insertions(+), 144 deletions(-)

Approvals:
  Lokal Profil: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/Hooks.php b/Hooks.php
index 679fc47..67817b3 100644
--- a/Hooks.php
+++ b/Hooks.php
@@ -59,12 +59,19 @@
                                'Wikispeech',
                                'HTML from onParserAfterTidy(): ' . $text
                        );
-                       $cleanedContents = Cleaner::cleanHtml( $text );
+                       global $wgWikispeechRemoveTags;
+                       global $wgWikispeechSegmentBreakingTags;
+                       $cleaner = new Cleaner(
+                               $wgWikispeechRemoveTags,
+                               $wgWikispeechSegmentBreakingTags
+                       );
+                       $cleanedContent = $cleaner->cleanHtml( $text );
                        wfDebugLog(
                                'Wikispeech',
-                               'Cleaned text: ' . var_export( 
$cleanedContents, true )
+                               'Cleaned text: ' . var_export( $cleanedContent, 
true )
                        );
-                       $utterances = Segmenter::segmentSentences( 
$cleanedContents );
+                       $segmenter = new Segmenter();
+                       $utterances = $segmenter->segmentSentences( 
$cleanedContent );
                        wfDebugLog(
                                'Wikispeech',
                                'Utterances: ' . var_export( $utterances, true )
diff --git a/extension.json b/extension.json
index 0a9490a..cbcfd19 100644
--- a/extension.json
+++ b/extension.json
@@ -22,7 +22,8 @@
                "SegmentBreak": "includes/CleanedContent.php",
                "Cleaner": "includes/Cleaner.php",
                "HtmlGenerator": "includes/HtmlGenerator.php",
-               "Segmenter": "includes/Segmenter.php"
+               "Segmenter": "includes/Segmenter.php",
+               "WikispeechApi": "includes/WikispeechApi.php"
        },
        "ResourceModules": {
                "ext.wikispeech": {
@@ -76,8 +77,8 @@
                        "editsection": true,
                        "toc": true,
                        "table": true,
-                       "sup": { "class": "reference" },
-                       "div": { "class": "thumb" }
+                       "sup": "reference",
+                       "div": "thumb"
                },
                "WikispeechSegmentBreakingTags": [
                        "h1",
@@ -117,5 +118,8 @@
                "WikispeechHelpPage": "Help:Wikispeech",
                "WikispeechFeedbackPage": "Wikispeech feedback",
                "WikispeechContentWrapperClass": "mw-parser-output"
+       },
+       "APIModules": {
+               "wikispeech": "WikispeechApi"
        }
 }
diff --git a/i18n/en.json b/i18n/en.json
index 39bf852..a32135e 100644
--- a/i18n/en.json
+++ b/i18n/en.json
@@ -7,5 +7,17 @@
        "wikispeech": "Wikispeech",
        "wikispeech-desc": "A text-to-speech tool to make Wikimedia's projects 
more accessible for people that for different reasons have difficulties 
reading.",
        "special-wikispeech-intro": "Wikispeech",
-       "special-wikispeech-title": "Special page introduction."
-}
\ No newline at end of file
+       "special-wikispeech-title": "Special page introduction.",
+       "apihelp-wikispeech-description": "Process a page to be used by the 
Wikispeech extension.",
+       "apihelp-wikispeech-summary": "Process a page to be used by the 
Wikispeech extension.",
+       "apihelp-wikispeech-param-page": "The title of the page to process.",
+       "apihelp-wikispeech-param-output": "The formats of the output, can be a 
combination of the following:",
+       "apihelp-wikispeech-paramvalue-output-originalcontent": "The original 
HTML of the page. Intended as help when debugging.",
+       "apihelp-wikispeech-paramvalue-output-cleanedtext": "The page content 
with HTML tags removed.",
+       "apihelp-wikispeech-paramvalue-output-segments": "The page content 
segmented into sentence.",
+       "apihelp-wikispeech-param-removetags": "The tags that should be removed 
completely during cleaning, as a JSON object of the format:\n<pre>{\n  
\"tagName1\": true,\n  \"tagName2\": \"cssClass\"\n}</pre>\nwhere 
<code>tagName1</code> is always removed and <code>tagName2</code> is only 
removed if it also has the CSS class <code>cssClass</code>.",
+       "apihelp-wikispeech-param-segmentbreakingtags": "The tag names for tags 
that should add segment breaks.",
+       "apihelp-wikispeech-example-1": "Get segments for the Wikispeech 
extension. The tag <code>sup</code> is removed and so is <code>div</code> when 
it has the class <code>toc</code>. The tags <code>h1</code> and <code>h2</code> 
break segments.",
+       "apihelp-wikispeech-example-2": "Get original HTML content and cleaned 
text.",
+       "apierror-nooutput": "Parameter <var>output</var> can't be empty."
+}
diff --git a/i18n/qqq.json b/i18n/qqq.json
index 669ae8e..69a0f23 100644
--- a/i18n/qqq.json
+++ b/i18n/qqq.json
@@ -7,5 +7,17 @@
        "wikispeech": "The name of the extension",
        "wikispeech-desc": 
"{{desc|name=Wikispeech|url=https://www.mediawiki.org/wiki/Extension:Wikispeech}}";,
        "special-wikispeech-intro": "Description appearing on top of 
<tvar|1>Special:{{ll|Wikispeech}}</>.",
-       "special-wikispeech-title": "Title of the special page 
Special:Wikispeech"
-}
\ No newline at end of file
+       "special-wikispeech-title": "Title of the special page 
Special:Wikispeech",
+       "apihelp-wikispeech-description": 
"{{doc-apihelp-description|wikispeech}}",
+       "apihelp-wikispeech-summary": "{{doc-apihelp-summary|wikispeech}}",
+       "apihelp-wikispeech-param-page": 
"{{doc-apihelp-param|wikispeech|page}}",
+       "apihelp-wikispeech-param-output": 
"{{doc-apihelp-param|wikispeech|output}}",
+       "apihelp-wikispeech-paramvalue-output-originalcontent": 
"{{doc-apihelp-paramvalue|wikispeech|output|originalcontent}}",
+       "apihelp-wikispeech-paramvalue-output-cleanedtext": 
"{{doc-apihelp-paramvalue|wikispeech|output|cleanedtext}}",
+       "apihelp-wikispeech-paramvalue-output-segments": 
"{{doc-apihelp-paramvalue|wikispeech|output|segments}}",
+       "apihelp-wikispeech-param-removetags": 
"{{doc-apihelp-param|wikispeech|removetags}}",
+       "apihelp-wikispeech-param-segmentbreakingtags": 
"{{doc-apihelp-param|wikispeech|segmentbreakingtags}}",
+       "apihelp-wikispeech-example-1": "{{doc-apihelp-example|wikispeech}}",
+       "apihelp-wikispeech-example-2": "{{doc-apihelp-example|wikispeech}}",
+       "apierror-nooutput": "{{doc-apierror}}"
+}
diff --git a/includes/Cleaner.php b/includes/Cleaner.php
index 64794e3..12bfce1 100644
--- a/includes/Cleaner.php
+++ b/includes/Cleaner.php
@@ -14,6 +14,40 @@
  */
 
 class Cleaner {
+       /**
+        * An array of tags that should be removed completely during cleaning.
+        *
+        * @var array $removeTags
+        */
+
+       private $removeTags;
+
+       /**
+        * An array of tags that should add a segment break during cleaning.
+        *
+        * @var array $segmentBreakingTags
+        */
+
+       private $segmentBreakingTags;
+
+       /**
+        * An array of `CleanedText`s and `SegmentBreak`s.
+        *
+        * @var array $cleanedContent
+        */
+
+       private $cleanedContent;
+
+       function __construct( $removeTags, $segmentBreakingTags ) {
+               if ( $removeTags == null ) {
+                       $removeTags = [];
+               }
+               if ( $segmentBreakingTags == null ) {
+                       $segmentBreakingTags = [];
+               }
+               $this->removeTags = $removeTags;
+               $this->segmentBreakingTags = $segmentBreakingTags;
+       }
 
        /**
         * Clean HTML tags from a string.
@@ -23,32 +57,30 @@
         * @since 0.0.1
         * @param string $markedUpText Input text that may contain HTML
         *  tags.
-        * @return array An array of `CleanedText`s representing text nodes.
+        * @return array An array of `CleanedText`s and `SegmentBreak`s
+        *  representing text nodes.
         */
 
-       public static function cleanHtml( $markedUpText ) {
+       public function cleanHtml( $markedUpText ) {
                $dom = self::createDomDocument( $markedUpText );
                $xpath = new DOMXPath( $dom );
                // Only add elements below the dummy element. These are the
                // elements from the original HTML.
                $top = $xpath->evaluate( '/meta/dummy' )->item( 0 );
-               $cleanedContent = [];
-               self::addContent(
-                       $cleanedContent,
-                       $top
-               );
+               $this->cleanedContent = [];
+               $this->addContent( $top );
                // Remove any segment break at the start or end of the array,
                // since they won't do anything.
                if (
-                       count( $cleanedContent ) &&
-                       $cleanedContent[0] instanceof SegmentBreak
+                       count( $this->cleanedContent ) &&
+                       $this->cleanedContent[0] instanceof SegmentBreak
                ) {
-                       array_shift( $cleanedContent );
+                       array_shift( $this->cleanedContent );
                }
-               if ( self::lastElement( $cleanedContent ) instanceof 
SegmentBreak ) {
-                       array_pop( $cleanedContent );
+               if ( self::lastElement( $this->cleanedContent ) instanceof 
SegmentBreak ) {
+                       array_pop( $this->cleanedContent );
                }
-               return $cleanedContent;
+               return $this->cleanedContent;
        }
 
        /**
@@ -84,29 +116,24 @@
         * content text. Adds segment breaks for appropriate tags.
         *
         * @since 0.0.1
-        * @param array $content The resulting array of `CleanedText`s
-        *  and `SegmentBreak`s.
         * @param DOMNode $node The top node to add from.
         */
 
-       private static function addContent(
-               &$content,
-               $node
-       ) {
-               global $wgWikispeechSegmentBreakingTags;
-               if ( !self::matchesRemove( $node ) ) {
+       private function addContent( $node ) {
+               if ( !$node instanceof DOMComment && !$this->matchesRemove( 
$node ) ) {
                        foreach ( $node->childNodes as $child ) {
                                if (
-                                       !self::lastElement( $content ) 
instanceof SegmentBreak &&
+                                       !self::lastElement( 
$this->cleanedContent )
+                                               instanceof SegmentBreak &&
                                        in_array(
                                                $child->nodeName,
-                                               $wgWikispeechSegmentBreakingTags
+                                               $this->segmentBreakingTags
                                        )
                                ) {
                                        // Add segment breaks for start tags 
specified in
-                                       // the config, unless the previous item 
is a
-                                       // break.
-                                       array_push( $content, new 
SegmentBreak() );
+                                       // the config, unless the previous item 
is a break
+                                       // or this is the first item.
+                                       array_push( $this->cleanedContent, new 
SegmentBreak() );
                                }
                                if ( $child->nodeType == XML_TEXT_NODE ) {
                                        // Remove the path to the dummy node 
and instead
@@ -117,23 +144,20 @@
                                                $child->getNodePath()
                                        );
                                        $text = new CleanedText( 
$child->textContent, $path );
-                                       array_push( $content, $text );
+                                       array_push( $this->cleanedContent, 
$text );
                                } else {
-                                       self::addContent(
-                                               $content,
-                                               $child
-                                       );
+                                       $this->addContent( $child );
                                }
                                if (
-                                       !self::lastElement( $content ) 
instanceof SegmentBreak &&
+                                       !self::lastElement( 
$this->cleanedContent ) instanceof SegmentBreak &&
                                        in_array(
                                                $child->nodeName,
-                                               $wgWikispeechSegmentBreakingTags
+                                               $this->segmentBreakingTags
                                        )
                                ) {
                                        // Add segment breaks for end tags 
specified in
                                        // the config.
-                                       array_push( $content, new 
SegmentBreak() );
+                                       array_push( $this->cleanedContent, new 
SegmentBreak() );
                                }
                        }
                }
@@ -151,18 +175,17 @@
         *  false.
         */
 
-       private static function matchesRemove( $node ) {
-               global $wgWikispeechRemoveTags;
-               if ( !array_key_exists( $node->nodeName, 
$wgWikispeechRemoveTags ) ) {
+       private function matchesRemove( $node ) {
+               if ( !array_key_exists( $node->nodeName, $this->removeTags ) ) {
                        // The node name isn't found in the removal list.
                        return false;
                }
-               $removeCriteria = $wgWikispeechRemoveTags[$node->nodeName];
+               $removeCriteria = $this->removeTags[$node->nodeName];
                if ( $removeCriteria === true ) {
                        // Node name is found and there are no extra criteria.
                        return true;
                }
-               if ( self::nodeHasClass( $node, $removeCriteria['class'] ) ) {
+               if ( self::nodeHasClass( $node, $removeCriteria ) ) {
                        // Node name and class name match.
                        return true;
                }
diff --git a/includes/Segmenter.php b/includes/Segmenter.php
index 9a4bb4f..d5ec537 100644
--- a/includes/Segmenter.php
+++ b/includes/Segmenter.php
@@ -17,18 +17,43 @@
 class Segmenter {
 
        /**
+        * An array to which finished segments are added.
+        *
+        * @var array $segments
+        */
+
+       private $segments;
+
+       /**
+        * The segment that is currently being built.
+        *
+        * @var array $segments
+        */
+
+       private $currentSegment;
+
+       function __construct() {
+               $this->segments = [];
+               $this->currentSegment = [
+                       'content' => [],
+                       'startOffset' => null,
+                       'endOffset' => null
+               ];
+       }
+
+       /**
         * Divide a cleaned content array into segments, one for each sentence.
         *
         * A segment is an array with the keys "content", "startOffset"
-        * and "endOffset". "content" is an array of `CleanedText`s.
-        * "startOffset" is the offset of the first character of the
-        * segment, within the text node it appears. "endOffset" is the
-        * offset of the last character of the segment, within the text
-        * node it appears. These are used to determine start and end of a
-        * segment in the original HTML.
+        * and "endOffset". "content" is an array of `CleanedText`s and
+        * `SegmentBreak`s. "startOffset" is the offset of the first
+        * character of the segment, within the text node it
+        * appears. "endOffset" is the offset of the last character of the
+        * segment, within the text node it appears. These are used to
+        * determine start and end of a segment in the original HTML.
         *
-        * A sentence is here defined as a number of tokens ending with a
-        * dot (full stop).
+        * A sentence is here defined as a sequence of tokens ending with
+        * a dot (full stop).
         *
         * @since 0.0.1
         * @param array $cleanedContent An array of items returned by
@@ -37,29 +62,19 @@
         *  `CleanedText's in that segment.
         */
 
-       public static function segmentSentences( $cleanedContent ) {
-               $segments = [];
-               $currentSegment = [
-                       'content' => [],
-                       'startOffset' => null,
-                       'endOffset' => null
-               ];
+       public function segmentSentences( $cleanedContent ) {
                foreach ( $cleanedContent as $item ) {
                        if ( $item instanceof CleanedText ) {
-                               self::addSegments(
-                                       $segments,
-                                       $currentSegment,
-                                       $item
-                               );
+                               $this->addSegments( $item );
                        } elseif ( $item instanceof SegmentBreak ) {
-                               self::finishSegment( $segments, $currentSegment 
);
+                               $this->finishSegment();
                        }
                }
-               if ( $currentSegment['content'] ) {
+               if ( $this->currentSegment['content'] ) {
                        // Add the last segment, unless it's empty.
-                       self::finishSegment( $segments, $currentSegment );
+                       $this->finishSegment();
                }
-               return $segments;
+               return $this->segments;
        }
 
        /**
@@ -70,24 +85,13 @@
         * added to the $currentSegment.
         *
         * @since 0.0.1
-        * @param array $segments The segment array to add new segments to.
-        * @param array $currentSegment The segment under construction.
         * @param CleanedText $text The text to segment.
         */
 
-       private static function addSegments(
-               &$segments,
-               &$currentSegment,
-               $text
-       ) {
+       private function addSegments( $text ) {
                $nextStartOffset = 0;
                do {
-                       $endOffset = self::addSegment(
-                               $segments,
-                               $currentSegment,
-                               $text,
-                               $nextStartOffset
-                       );
+                       $endOffset = $this->addSegment( $text, $nextStartOffset 
);
                        // The earliest the next segments can start is one after
                        // the end of the current one.
                        $nextStartOffset = $endOffset + 1;
@@ -104,24 +108,14 @@
         * offset when the last is.
         *
         * @since 0.0.1
-        * @param array $segments The segment array to add new segments to.
-        * @param array $currentSegment The segment under construction.
         * @param CleanedText $text The text to segment.
         * @param int $startOffset The offset where the next sentence can
         *  start, at the earliest. If the sentence has leading
         *  whitespaces, this will be moved forward.
-        * @return int The offset of the last character in the
-        *   sentence. If the sentence didn't end yet, this is the last
-        *   character of $text.
         */
 
-       private static function addSegment(
-               &$segments,
-               &$currentSegment,
-               $text,
-               $startOffset=0
-       ) {
-               if ( $currentSegment['startOffset'] === null ) {
+       private function addSegment( $text, $startOffset=0 ) {
+               if ( $this->currentSegment['startOffset'] === null ) {
                        // Move the start offset ahead by the number of leading
                        // whitespaces. This means that whitespaces before or
                        // between segments aren't included.
@@ -147,21 +141,22 @@
                        $startOffset,
                        $endOffset - $startOffset + 1
                );
-               if ( $sentence !== '' ) {
-                       // Don't add `CleanedText`s with the empty string.
+               if ( $sentence !== '' && $sentence !== "\n" ) {
+                       // Don't add `CleanedText`s with the empty string or 
only
+                       // newline.
                        $sentenceText = new CleanedText(
                                $sentence,
                                $text->path
                        );
-                       array_push( $currentSegment['content'], $sentenceText );
-                       if ( $currentSegment['startOffset'] === null ) {
+                       array_push( $this->currentSegment['content'], 
$sentenceText );
+                       if ( $this->currentSegment['startOffset'] === null ) {
                                // Record the start offset if this is the first 
text
                                // added to the segment.
-                               $currentSegment['startOffset'] = $startOffset;
+                               $this->currentSegment['startOffset'] = 
$startOffset;
                        }
-                       $currentSegment['endOffset'] = $endOffset;
+                       $this->currentSegment['endOffset'] = $endOffset;
                        if ( $ended ) {
-                               self::finishSegment( $segments, $currentSegment 
);
+                               $this->finishSegment();
                        }
                }
                return $endOffset;
@@ -271,12 +266,12 @@
         * @param array $currentCegments The finished segment to add.
         */
 
-       private static function finishSegment( &$segments, &$currentSegment ) {
-               if ( count( $currentSegment['content'] ) ) {
-                       array_push( $segments, $currentSegment );
+       private function finishSegment() {
+               if ( count( $this->currentSegment['content'] ) ) {
+                       array_push( $this->segments, $this->currentSegment );
                }
                // Create a fresh segment to add following text to.
-               $currentSegment = [
+               $this->currentSegment = [
                        'content' => [],
                        'startOffset' => null,
                        'endOffset' => null
diff --git a/includes/WikispeechApi.php b/includes/WikispeechApi.php
new file mode 100644
index 0000000..d54c432
--- /dev/null
+++ b/includes/WikispeechApi.php
@@ -0,0 +1,180 @@
+<?php
+
+/**
+ * @file
+ * @ingroup Extensions
+ * @license GPL-2.0+
+ */
+
+class WikispeechApi extends ApiBase {
+
+       /**
+        * Execute an API request.
+        *
+        * @since 0.0.1
+        */
+
+       function execute() {
+               if ( $this->getMain()->getVal( 'output' ) == '' ) {
+                       $this->dieWithError( 'apierror-nooutput' );
+               }
+               $outputFormats = $this->parseMultiValue(
+                       'output',
+                       $this->getMain()->getVal( 'output' ),
+                       true,
+                       [ 'originalcontent', 'cleanedtext', 'segments' ]
+               );
+               $pageTitle = $this->getMain()->getVal( 'page' );
+               $pageContent = $this->getPageContent( $pageTitle );
+               $removeTags = json_decode(
+                       $this->getMain()->getVal( 'removetags' ),
+                       true
+               );
+               $segmentBreakingTags = $this->parseMultiValue(
+                       'segmentbreakingtags',
+                       $this->getMain()->getVal( 'segmentbreakingtags' ),
+                       true,
+                       null
+               );
+               $this->processPageContent(
+                       $pageContent,
+                       $outputFormats,
+                       $removeTags,
+                       $segmentBreakingTags
+               );
+       }
+
+       /**
+        * Process HTML and return it as original, cleaned and/or segmented.
+        *
+        * @since 0.0.1
+        * @param string $html The HTML string to process.
+        * @param array $outputFormats Specifies what output formats to
+        *  return. Can be any combination of: "originalcontent",
+        *  "cleanedtext" and "segments".
+        * @param string $removeTags Used by `Cleaner` to remove tags.
+        * @param array $segmentBreakingTags Used by `Segmenter` to break
+        *  segments.
+        * @return array An array containing the output from the processes
+        *  specified by $outputFormats:
+        *  * "originalcontent": The input HTML string.
+        *  * "cleanedtext": The cleaned HTML, as a string.
+        *  * "segments": Cleaned and segmented HTML as an array.
+        */
+
+       public function processPageContent(
+               $pageContent,
+               $outputFormats,
+               $removeTags,
+               $segmentBreakingTags
+       ) {
+               $values = [];
+               if ( in_array( 'originalcontent', $outputFormats ) ) {
+                       $values['originalcontent'] = $pageContent;
+               }
+               $cleaner = new Cleaner( $removeTags, $segmentBreakingTags );
+               $cleanedText = null;
+               if ( in_array( 'cleanedtext', $outputFormats ) ) {
+                       $cleanedText = $cleaner->cleanHtml( $pageContent );
+                       // Make a string of all the cleaned text.
+                       $cleanedTextString = '';
+                       foreach ( $cleanedText as $item ) {
+                               if ( $item instanceof SegmentBreak ) {
+                                       $cleanedTextString .= "\n";
+                               } elseif ( $item->string != "\n" ) {
+                                       // Don't add text that is only newline.
+                                       $cleanedTextString .= $item->string;
+                               }
+                       }
+                       $values['cleanedtext'] = trim( $cleanedTextString );
+               }
+               $segmenter = new Segmenter();
+               if ( in_array( 'segments', $outputFormats ) ) {
+                       if ( $cleanedText == null ) {
+                               $cleanedText = $cleaner->cleanHtml( 
$pageContent );
+                       }
+                       $segments = $segmenter->segmentSentences( $cleanedText 
);
+                       $values['segments'] = $segments;
+               }
+               $this->getResult()->addValue(
+                       null,
+                       $this->getModuleName(),
+                       $values
+               );
+       }
+
+       /**
+        * Request the parsed content from the main API.
+        *
+        * @since 0.0.1
+        * @param string $pageTitle The title of the page to get content
+        *  from.
+        * @return string The parsed content for the page given in the
+        *  request to the Wikispeech API.
+        */
+
+       private function getPageContent( $pageTitle ) {
+               $request = new FauxRequest( [
+                       'action' => 'parse',
+                       'page' => $pageTitle
+               ] );
+               $api = new ApiMain( $request );
+               $api->execute();
+               $pageContent = $api->getResult()->getResultData( [ 'parse', 
'text' ] );
+               return $pageContent;
+       }
+
+       /**
+        * Specify what parameters the API accepts.
+        *
+        * @since 0.0.1
+        * @return array
+        */
+
+       public function getAllowedParams() {
+               return array_merge(
+                       parent::getAllowedParams(),
+                       [
+                               'page' => [
+                                       ApiBase::PARAM_TYPE => 'string',
+                                       ApiBase::PARAM_REQUIRED => true
+                               ],
+                               'output' => [
+                                       ApiBase::PARAM_TYPE => [
+                                               'originalcontent',
+                                               'cleanedtext',
+                                               'segments'
+                                       ],
+                                       ApiBase::PARAM_REQUIRED => true,
+                                       ApiBase::PARAM_ISMULTI => true,
+                                       ApiBase::PARAM_HELP_MSG_PER_VALUE => []
+                               ],
+                               'removetags' => [
+                                       ApiBase::PARAM_TYPE => 'string'
+                               ],
+                               'segmentbreakingtags' => [
+                                       ApiBase::PARAM_TYPE => 'string',
+                                       ApiBase::PARAM_ISMULTI => true
+                               ]
+                       ]
+               );
+       }
+
+       /**
+        * Give examples of usage.
+        *
+        * @since 0.0.1
+        * @return array
+        */
+
+       public function getExamplesMessages() {
+               return [
+               // @codingStandardsIgnoreStart
+                       
'action=wikispeech&format=json&page=Main_Page&output=segments&removetags={"sup":
 true, "div": "toc"}&segmentbreakingtags=h1|h2'
+               // @codingStandardsIgnoreEnd
+                       => 'apihelp-wikispeech-example-1',
+                       
'action=wikispeech&format=json&page=Main_Page&output=originalcontent|cleanedtext'
+                       => 'apihelp-wikispeech-example-2',
+               ];
+       }
+}
diff --git a/tests/phpunit/CleanerTest.php b/tests/phpunit/CleanerTest.php
index 5696932..37f0f81 100644
--- a/tests/phpunit/CleanerTest.php
+++ b/tests/phpunit/CleanerTest.php
@@ -12,19 +12,21 @@
 class CleanerTest extends MediaWikiTestCase {
        protected function setUp() {
                parent::setUp();
-               global $wgWikispeechRemoveTags;
-               $wgWikispeechRemoveTags = [
+               $removeTags = [
                        'table' => true,
-                       'sup' => [ 'class' => 'reference' ],
+                       'sup' => 'reference',
                        'editsection' => true,
                        'h2' => false,
                        'del' => true
                ];
-               global $wgWikispeechSegmentBreakingTags;
-               $wgWikispeechSegmentBreakingTags = [
+               $segmentBreakingTags = [
                        'hr',
                        'a'
                ];
+               $this->cleaner = new Cleaner(
+                       $removeTags,
+                       $segmentBreakingTags
+               );
        }
 
        public function testCleanTags() {
@@ -60,7 +62,7 @@
        ) {
                $this->assertContentsEqual(
                        $expectedCleanedContents,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
                $this->assertWithPrefixAndSuffix(
                        $expectedCleanedContents,
@@ -149,7 +151,7 @@
                }
                $this->assertContentsEqual(
                        $expectedCleanedContents,
-                       Cleaner::cleanHtml( 'prefix' . $markedUpText . 'suffix' 
)
+                       $this->cleaner->cleanHtml( 'prefix' . $markedUpText . 
'suffix' )
                );
        }
 
@@ -187,7 +189,7 @@
                }
                $this->assertContentsEqual(
                        array_merge( $firstContents, [ $infix ], 
$secondContents ),
-                       Cleaner::cleanHtml( $markedUpText . 'infix' . 
$markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText . 'infix' . 
$markedUpText )
                );
        }
 
@@ -198,7 +200,7 @@
                ];
                $this->assertContentsEqual(
                        $expectedCleanedContent,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -265,7 +267,7 @@
                ];
                $this->assertContentsEqual(
                        $expectedCleanedContents,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -279,7 +281,7 @@
                ];
                $this->assertContentsEqual(
                        $expectedCleanedContents,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -293,7 +295,7 @@
                ];
                $this->assertContentsEqual(
                        $expectedCleanedContents,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -307,7 +309,7 @@
                ];
                $this->assertContentsEqual(
                        $expectedCleanedContents,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -321,7 +323,7 @@
                ];
                $this->assertContentsEqual(
                        $expectedCleanedContents,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -333,7 +335,7 @@
                ];
                $this->assertContentsEqual(
                        $expectedCleanedContents,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -387,6 +389,12 @@
                $this->assertTextCleaned( $expectedCleanedContent, 
$markedUpText );
        }
 
+       public function testIgnoreComments() {
+               $markedUpText = '<!-- A comment. -->';
+               $expectedCleanedContent = [];
+               $this->assertTextCleaned( $expectedCleanedContent, 
$markedUpText );
+       }
+
        public function testGeneratePaths() {
                $markedUpText = '<i>level one<br /><b>level two</b></i>level 
zero';
                $expectedCleanedContent = [
@@ -396,7 +404,7 @@
                ];
                $this->assertEquals(
                        $expectedCleanedContent,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -408,7 +416,7 @@
                ];
                $this->assertEquals(
                        $expectedCleanedContent,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 
@@ -420,7 +428,7 @@
                ];
                $this->assertEquals(
                        $expectedCleanedContent,
-                       Cleaner::cleanHtml( $markedUpText )
+                       $this->cleaner->cleanHtml( $markedUpText )
                );
        }
 }
diff --git a/tests/phpunit/SegmenterTest.php b/tests/phpunit/SegmenterTest.php
index 18b5b7e..6aa40a4 100644
--- a/tests/phpunit/SegmenterTest.php
+++ b/tests/phpunit/SegmenterTest.php
@@ -10,6 +10,11 @@
 require_once 'Util.php';
 
 class SegmenterTest extends MediaWikiTestCase {
+       protected function setUp() {
+               parent::setUp();
+               $this->segmenter = new Segmenter();
+       }
+
        public function testSegmentSentences() {
                $cleanedContent = [
                        new CleanedText( 'Sentence 1. Sentence 2. Sentence 3.' )
@@ -31,7 +36,7 @@
                                'content' => [ new CleanedText( 'Sentence 3.' ) 
]
                        ]
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals( $expectedSegments, $segments );
        }
 
@@ -39,7 +44,7 @@
                $cleanedContent = [
                        new CleanedText( 'This is... one sentence.' )
                        ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        [ new CleanedText( 'This is... one sentence.' ) ],
                        $segments[0]['content']
@@ -48,7 +53,7 @@
 
        public function testDontSegmentByAbbreviations() {
                $cleanedContent = [ new CleanedText( 'One sentence i.e. one 
segment.' ) ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        [ new CleanedText( 'One sentence i.e. one segment.' ) ],
                        $segments[0]['content']
@@ -57,7 +62,7 @@
 
        public function testDontSegmentByDotDirectlyFollowedByComma() {
                $cleanedContent = [ new CleanedText( 'As with etc., jr. and 
friends.' ) ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        [ new CleanedText( 'As with etc., jr. and friends.' ) ],
                        $segments[0]['content']
@@ -66,7 +71,7 @@
 
        public function testDontSegmentByDecimalDot() {
                $cleanedContent = [ new CleanedText( 'In numbers like 2.9.' ) ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        [ new CleanedText( 'In numbers like 2.9.' ) ],
                        $segments[0]['content']
@@ -75,7 +80,7 @@
 
        public function 
testKeepLastSegmentEvenIfNotEndingWithSentenceFinalCharacter() {
                $cleanedContent = [ new CleanedText( 'Sentence. No sentence 
final' ) ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        [ new CleanedText( 'No sentence final' ) ],
                        $segments[1]['content']
@@ -89,7 +94,7 @@
                        new CleanedText( 'by' ),
                        new CleanedText( ' tags.' )
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        0,
                        $segments[0]['startOffset']
@@ -117,7 +122,7 @@
                        new CleanedText( 'First sentence. Split' ),
                        new CleanedText( 'sentence. And other sentence.' ),
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        16,
                        $segments[1]['startOffset']
@@ -130,7 +135,7 @@
 
        public function testTextOffset() {
                $cleanedContent = [ new CleanedText( 'Sentence.' ) ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals( 0, $segments[0]['startOffset'] );
                $this->assertEquals( 8, $segments[0]['endOffset'] );
        }
@@ -141,7 +146,7 @@
                                'Normal sentence. Utterance with å. Another 
normal sentence.'
                        )
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        'Utterance with å.',
                        $segments[1]['content'][0]->string
@@ -161,7 +166,7 @@
                        new CleanedText( 'Sentence one' ),
                        new CleanedText( '. Sentence two.' )
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        'Sentence one',
                        $segments[0]['content'][0]->string
@@ -181,7 +186,7 @@
                        new CleanedText( 'Sentence 1. ' ),
                        new CleanedText( 'Sentence 2.' )
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        'Sentence 1.',
                        $segments[0]['content'][0]->string
@@ -200,7 +205,7 @@
                        new SegmentBreak(),
                        new CleanedText( 'Text two' )
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        'Text one',
                        $segments[0]['content'][0]->string
@@ -216,7 +221,7 @@
                        new CleanedText( ' ' ),
                        new CleanedText( 'Sentence 1.' )
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        'Sentence 1.',
                        $segments[0]['content'][0]->string
@@ -225,7 +230,7 @@
 
        public function testRemoveLeadingAndTrailingWhitespaces() {
                $cleanedContent = [ new CleanedText( ' Sentence. ' ) ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        'Sentence.',
                        $segments[0]['content'][0]->string
@@ -234,13 +239,29 @@
                $this->assertEquals( 9, $segments[0]['endOffset'] );
        }
 
+       public function testDontAddOnlyNewlineItem() {
+               $cleanedContent = [
+                       new CleanedText( 'text' ),
+                       new CleanedText( "\n" )
+               ];
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
+               $this->assertEquals(
+                       1,
+                       count( $segments[0]['content'] )
+               );
+               $this->assertEquals(
+                       'text',
+                       $segments[0]['content'][0]->string
+               );
+       }
+
        public function testLastTextIsOnlySentenceFinalCharacter() {
                $cleanedContent = [
                        new CleanedText( 'Sentence one' ),
                        new CleanedText( '. ' ),
                        new CleanedText( 'Sentence two.' )
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        'Sentence one',
                        $segments[0]['content'][0]->string
@@ -263,7 +284,7 @@
                        new SegmentBreak(),
                        new CleanedText( 'Paragraph two' )
                ];
-               $segments = Segmenter::segmentSentences( $cleanedContent );
+               $segments = $this->segmenter->segmentSentences( $cleanedContent 
);
                $this->assertEquals(
                        'Header',
                        $segments[0]['content'][0]->string
diff --git a/tests/phpunit/WikispeechApiTest.php 
b/tests/phpunit/WikispeechApiTest.php
new file mode 100644
index 0000000..41da705
--- /dev/null
+++ b/tests/phpunit/WikispeechApiTest.php
@@ -0,0 +1,99 @@
+<?php
+
+/**
+ * @file
+ * @ingroup Extensions
+ * @group Database
+ * @group medium
+ * @license GPL-2.0+
+ */
+
+require_once __DIR__ . '/../../includes/WikispeechApi.php';
+require_once 'Util.php';
+
+define( 'TITLE', 'Talk:Page' );
+
+class WikispeechApiTest extends ApiTestCase {
+       public function addDBDataOnce() {
+               $content = "Text ''italic'' '''bold'''";
+               $this->addPage( TITLE, $content );
+       }
+
+       private function addPage( $titleString, $content ) {
+               $title = Title::newFromText( $titleString );
+               $page = WikiPage::factory( $title );
+               $status = $page->doEditContent(
+                       ContentHandler::makeContent(
+                               $content,
+                               $title,
+                               CONTENT_MODEL_WIKITEXT
+                       ),
+                       ''
+               );
+               if ( !$status->isOk() ) {
+                       $this->fail( "Failed to create $title: " . 
$status->getWikiText( false, false, 'en' ) );
+               }
+       }
+
+       public function testCleanText() {
+               $res = $this->doApiRequest( [
+                       'action' => 'wikispeech',
+                       'page' => TITLE,
+                       'output' => 'cleanedtext'
+               ] );
+               $this->assertEquals(
+                       'Text italic bold',
+                       $res[0]['wikispeech']['cleanedtext']
+               );
+       }
+
+       public function testCleanTextHandleSegmentBreaks() {
+               $title = 'Talk:Break';
+               $content = 'Text with<br/ >break.';
+               $this->addPage( $title, $content );
+               $res = $this->doApiRequest( [
+                       'action' => 'wikispeech',
+                       'page' => $title,
+                       'output' => 'cleanedtext',
+                       'segmentbreakingtags' => 'br'
+               ] );
+               $this->assertEquals(
+                       "Text with\nbreak.",
+                       $res[0]['wikispeech']['cleanedtext']
+               );
+       }
+
+       public function testSegmentText() {
+               $res = $this->doApiRequest( [
+                       'action' => 'wikispeech',
+                       'page' => TITLE,
+                       'output' => 'segments'
+               ] );
+               $this->assertEquals( 1, count( 
$res[0]['wikispeech']['segments'] ) );
+               $this->assertEquals(
+                       [
+                               'startOffset' => 0,
+                               'endOffset' => 3,
+                               'content' => [
+                                       [
+                                               'string' => 'Text ',
+                                               'path' => './div/p/text()[1]'
+                                       ],
+                                       [
+                                               'string' => 'italic',
+                                               'path' => './div/p/i/text()'
+                                       ],
+                                       [
+                                               'string' => ' ',
+                                               'path' => './div/p/text()[2]'
+                                       ],
+                                       [
+                                               'string' => 'bold',
+                                               'path' => './div/p/b/text()'
+                                       ]
+                               ]
+                       ],
+                       $res[0]['wikispeech']['segments'][0]
+               );
+       }
+}

-- 
To view, visit https://gerrit.wikimedia.org/r/358378
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I78db6c6a64e9d04e6907d2eb96af47ca368cf82e
Gerrit-PatchSet: 11
Gerrit-Project: mediawiki/extensions/Wikispeech
Gerrit-Branch: master
Gerrit-Owner: Sebastian Berlin (WMSE) <[email protected]>
Gerrit-Reviewer: Anomie <[email protected]>
Gerrit-Reviewer: Lokal Profil <[email protected]>
Gerrit-Reviewer: Sebastian Berlin (WMSE) <[email protected]>
Gerrit-Reviewer: Siebrand <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] mediawiki...Wikispeech[master]: Add API for segmenting text

Reply via email to